Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jan;41(Database issue):D195-202.
doi: 10.1093/nar/gks1089. Epub 2012 Nov 21.

HOCOMOCO: A Comprehensive Collection of Human Transcription Factor Binding Sites Models

Free PMC article

HOCOMOCO: A Comprehensive Collection of Human Transcription Factor Binding Sites Models

Ivan V Kulakovskiy et al. Nucleic Acids Res. .
Free PMC article


Transcription factor (TF) binding site (TFBS) models are crucial for computational reconstruction of transcription regulatory networks. In existing repositories, a TF often has several models (also called binding profiles or motifs), obtained from different experimental data. Having a single TFBS model for a TF is more pragmatic for practical applications. We show that integration of TFBS data from various types of experiments into a single model typically results in the improved model quality probably due to partial correction of source specific technique bias. We present the Homo sapiens comprehensive model collection (HOCOMOCO,, containing carefully hand-curated TFBS models constructed by integration of binding sequences obtained by both low- and high-throughput methods. To construct position weight matrices to represent these TFBS models, we used ChIPMunk software in four computational modes, including newly developed periodic positional prior mode associated with DNA helix pitch. We selected only one TFBS model per TF, unless there was a clear experimental evidence for two rather distinct TFBS models. We assigned a quality rating to each model. HOCOMOCO contains 426 systematically curated TFBS models for 401 human TFs, where 172 models are based on more than one data source.


Figure 1.
Figure 1.
Comparison of AUC ratios for TFBS models of JASPAR (green bars), TRANSFAC (red curve) and HOCOMOCO (blue curve) TFBS models. Value of 1 corresponds to the best model with the highest AUC value. Points on X-axis correspond to control sets for different TFs. Y-axis shows AUC ratios. If several TFBS models were present in a collection, the best result is shown. Details are given in the text.
Figure 2.
Figure 2.
TFBS model LOGOs for highly similar models within TF families. LOGOs for selected members of CEBP, E2F and SP families are given. The Discrete Information Content is used for nucleotide scaling as in (29). Note that in our LOGO representation, the dominant nucleotides are placed at the bottom enabling easy observing the sequence of the best scoring binding site.

Similar articles

See all similar articles

Cited by 88 articles

See all "Cited by" articles


    1. Bailey TL. Discovering sequence motifs. Methods Mol. Biol. 2008;452:231–251. - PubMed
    1. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23. - PubMed
    1. Kulakovskiy IV, Belostotsky AA, Kasianov AS, Esipova NG, Medvedeva YA, Eliseeva IA, Makeev VJ. A deeper look into transcription regulatory code by preferred pair distance templates for transcription factor binding sites. Bioinformatics. 2011;27:2621–2624. - PubMed
    1. Nikulova AA, Favorov AV, Sutormin RA, Makeev VJ, Mironov AA. CORECLUST: identification of the conserved CRM grammar together with prediction of gene regulation. Nucleic Acids Res. 2012;40:e93. - PMC - PubMed
    1. Macintyre G, Bailey J, Haviv I, Kowalczyk A. is-rSNP: a novel technique for in silico regulatory SNP detection. Bioinformatics. 2010;26:i524–i530. - PMC - PubMed

Publication types