We present a new algorithm called PromoterInspector to locate eukaryotic polymase II promoter regions in large genomic sequences with a high degree of specificity. PromoterInspector focuses on the genetic context of promoters, rather than their exact location. Application of PromoterInspector can serve as a crucial pre-processing step for other methods to locate exactly, or to analyze promoters. PromoterInspector does not depend on heuristics, because it is purely based on libraries of IUPAC words extracted from training sequences by an unsupervised learning approach. We compared PromoterInspector to in silico promoter prediction tools using the sequences from the review by J.W. Fickett. PromoterInspector compared favourably on Fickett's evaluation scheme. A true positive to false positive ratio of 2.3 was obtained, surpassing the best ratio of 0.6, reported for TSSG. The application of our method to several large genomic sequences of over 1.3 million base-pairs in total resulted in even more specific predictions. The coverage of annotated promoters was comparable to other in silico promoter prediction methods, while the true positive predictions increased by up to 100% of total matches. PromoterInspector scans 100 kb in less than one minute on a workstation, and thus is especially applicable for large genome analysis. The method is available at http://genomatix.gsf. de/cgi-bin/promoterinspector/promoterinspector.pl.
Copyright 2000 Academic Press.