Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Mar 30;6:79.
doi: 10.1186/1471-2105-6-79.

MAPPER: A Search Engine for the Computational Identification of Putative Transcription Factor Binding Sites in Multiple Genomes

Affiliations
Free PMC article

MAPPER: A Search Engine for the Computational Identification of Putative Transcription Factor Binding Sites in Multiple Genomes

Voichita D Marinescu et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Cis-regulatory modules are combinations of regulatory elements occurring in close proximity to each other that control the spatial and temporal expression of genes. The ability to identify them in a genome-wide manner depends on the availability of accurate models and of search methods able to detect putative regulatory elements with enhanced sensitivity and specificity.

Results: We describe the implementation of a search method for putative transcription factor binding sites (TFBSs) based on hidden Markov models built from alignments of known sites. We built 1,079 models of TFBSs using experimentally determined sequence alignments of sites provided by the TRANSFAC and JASPAR databases and used them to scan sequences of the human, mouse, fly, worm and yeast genomes. In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods. Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method.

Conclusion: The search engine, available at http://mapper.chip.org, allows the identification, visualization and selection of putative TFBSs occurring in the promoter or other regions of a gene from the human, mouse, fly, worm and yeast genomes. In addition it allows the user to upload a sequence to query and to build a model by supplying a multiple sequence alignment of binding sites for a transcription factor of interest. Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.

Figures

Figure 1
Figure 1
Quality measures for the alignments retrieved. A. Distribution of the parameters characterizing the model (length, number of sequences and the size of the nucleotide matrix used to train the model). B. Distribution of the median and average quality of the nucleotide sequences used to build the alignments for the TRANSFAC factor-derived models. The quality variable is categorical and represents "1 – functionally confirmed factor binding site; 2 – binding of pure protein purified or recombinant, 3 – immunologically characterized binding activity of a cellular extract, 4 – binding activity characterized via a known binding sequence, 5 – binding of uncharacterized extract protein to a bona fide element, 6 – no quality assigned" (cf. TRANSFAC documentation).
Figure 2
Figure 2
The selection page of the search engine. The selection page for the MCM5 gene displays detailed information on the gene and its homologs available in our database, and allows the user to select the gene region to be scanned. The same region will be scanned for all homologs included in the search.
Figure 3
Figure 3
The output of the query for the human MCM5 gene. The output was edited to highlight the E2F binding sites discussed in the text. The hit alignment window shows the match between the sequence at positions +2 to +12 from the transcript start and model T05206. The set of hits can be sorted by position, name or accession number of the factor. The position of the hits can be displayed with respect to the start of the transcript, the ATG or as absolute coordinates on the chromosome. The page can display the list of common factors that bind to the same selected region in the homologs included in the analysis, the factors on the list that are known to physically interact or the different classes to which they belong. In addition, the hits occurring in evolutionarily conserved regions can be highlighted.
Figure 4
Figure 4
Different representations of the set of putative TFBSs in the human MCM5 gene promoter. A. Graphical representation of the hit set presented in Figure 3. B. The hit set was exported to the UCSC Human Genome Browser as a custom track. The region displayed in this image extends to 500 bp upstream of the coding sequence start. Note that the clusters of predicted binding sites correspond to peaks in the human/mouse conservation track at the bottom, suggesting that those regions are functional. The positions of the most conserved elements displayed in the conservation track are the ones used in the previous page to highlight hits in evolutionary conserved regions (see Methods for details).
Figure 5
Figure 5
The page for model T05206 for E2F-4:DP-1. The model page displays detailed information regarding the model including the name and (if available) organism and classification of the factor, the model length, the number of sequences in the alignment used to train the model and the references used to select these sequences. The page also displays the HMM logo generated using the LogoMat-M software [89].

Similar articles

See all similar articles

Cited by 106 articles

See all "Cited by" articles

References

    1. Ghazi A, VijayRaghavan KV. Developmental biology. Control by combinatorial codes. Nature. 2000;408:419–420. doi: 10.1038/35044174. - DOI - PubMed
    1. Bulyk ML. Computational prediction of transcription-factor binding site locations. Genome Biol. 2003;5:201. doi: 10.1186/gb-2003-5-1-201. - DOI - PMC - PubMed
    1. Qiu P. Recent advances in computational promoter analysis in understanding the transcriptional regulatory network. Biochem Biophys Res Commun. 2003;309:495–501. doi: 10.1016/j.bbrc.2003.08.052. - DOI - PubMed
    1. Pennacchio LA, Rubin EM. Comparative genomic tools and databases: providing insights into the human genome. J Clin Invest. 2003;111:1099–1106. doi: 10.1172/JCI200317842. - DOI - PMC - PubMed
    1. Pennacchio LA, Rubin EM. Genomic strategies to identify mammalian regulatory sequences. Nat Rev Genet. 2001;2:100–109. doi: 10.1038/35052548. - DOI - PubMed

MeSH terms

LinkOut - more resources

Feedback