Motivation: Precise analysis of the genetic network, gene function and transcription regulation requires accurate prediction of transcription factor (TF) bindability on DNA. For calculating the matching score between an input sequence and a set of known TF binding sites, we use positional weight matrices (PWMs) and Bucher's calculating method (Bucher, J. Mol. Biol., 212, 563-578, 1990). Since estimating TF binding sites requires cut-off values, we propose a robust cut-off value determining algorithm.
Results: We generalize the concept of local overrepresentation with statistics, and propose a new algorithm for determining the cut-off value using the background rate estimated on non-promoters. The algorithm iteratively determines parameters separating instances into phenomena-dependent and phenomena-independent subsets. Our system includes the method of re-estimating cut-off values of TFs that mis-recognize other TF preferred regions. Our data source comprised 433 non-redundant vertebrate promoters including viral promoters, from Eukaryotic Promoter Database (EPD) R.50. The method is applied to 205 vertebrate TFs that have frequency matrices in TRANSFAC Ver.3. 4 and the cut-off values of all of them can be determined.
Availability: The cut-off values and TF binding site predicting tool are available at http://www.hgc.ims.u-tokyo.ac. jp/service/tooldoc/TFBIND. We also provide the cut-off value estimating programs.