Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Nov;41(21):e197.
doi: 10.1093/nar/gkt831. Epub 2013 Sep 20.

A General Approach for Discriminative De Novo Motif Discovery From High-Throughput Data

Affiliations
Free PMC article

A General Approach for Discriminative De Novo Motif Discovery From High-Throughput Data

Jan Grau et al. Nucleic Acids Res. .
Free PMC article

Abstract

De novo motif discovery has been an important challenge of bioinformatics for the past two decades. Since the emergence of high-throughput techniques like ChIP-seq, ChIP-exo and protein-binding microarrays (PBMs), the focus of de novo motif discovery has shifted to runtime and accuracy on large data sets. For this purpose, specialized algorithms have been designed for discovering motifs in ChIP-seq or PBM data. However, none of the existing approaches work perfectly for all three high-throughput techniques. In this article, we propose Dimont, a general approach for fast and accurate de novo motif discovery from high-throughput data. We demonstrate that Dimont yields a higher number of correct motifs from ChIP-seq data than any of the specialized approaches and achieves a higher accuracy for predicting PBM intensities from probe sequence than any of the approaches specifically designed for that purpose. Dimont also reports the expected motifs for several ChIP-exo data sets. Investigating differences between in vitro and in vivo binding, we find that for most transcription factors, the motifs discovered by Dimont are in good accordance between techniques, but we also find notable exceptions. We also observe that modeling intra-motif dependencies may increase accuracy, which indicates that more complex motif models are a worthwhile field of research.

Figures

Figure 1.
Figure 1.
Normalized likelihood profile of a sequence. The red dashed line visualizes the threshold that is used to accelerate the algorithm. All positions with peaks above the threshold are included in formula image, and all remaining positions are not used for evaluating the likelihood.
Figure 2.
Figure 2.
Runtime evaluation of Dimont on the data sets used in this article. We consider all ChIP-seq data sets (blue), ChIP-exo (red) and PBM (green) data sets used in this article. Upright triangles represent the runtime without the speed-up strategy, whereas reversed triangles represent the runtime using the speed-up strategy. Runtime decreases by a factor of 5 to 29 due to the speed-up strategy.
Figure 3.
Figure 3.
Three exemplary motifs discovered by Dimont on the FoxA2, Tcfcp2l1 and KNI data sets of Ma et al. (4) compared with the corresponding motifs from the Jaspar database.
Figure 4.
Figure 4.
Motifs discovered by Dimont on three of the yeast ChIP-exo data sets of Rhee and Pugh (2) compared with the corresponding motifs from the Jaspar database.
Figure 5.
Figure 5.
Motifs discovered by Dimont on the ChIP-seq and ChIP-exo data sets of the human insulator CTCF compared with the CTCF motif from the Jaspar database.
Figure 6.
Figure 6.
Influence of the choice of background order for different motif orders and weighting factor on the performance on the tuning data sets of the DREAM5 challenge. In the first row, we plot performance against background order for motif orders 0, 1 and 2 and a fixed weighting factor of 0.01. In the second row, we plot performance against weighting factors for a uniform background model and background orders 0 to 5, given a fixed motif order of 1.
Figure 7.
Figure 7.
Comparison of the motifs discovered by Dimont using PBM and ChIP-seq or ChIP-exo data. For Esrrb, Foxo1, Gata4 and Zfx, we obtain largely similar motifs for PBM and ChIP-seq/ChIP-exo data, whereas we find minor differences for Nr5a2, Phd1, Rap1 and Tcf3. In case of Tbx5 and Tbx20, the motifs discovered from PWM and ChIP-seq data differ substantially.

Similar articles

See all similar articles

Cited by 24 articles

See all "Cited by" articles

References

    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–1502. - PubMed
    1. Rhee HS, Pugh BF. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell. 2011;147:1408– 1419. - PMC - PubMed
    1. Berger MF, Bulyk ML. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat. Protoc. 2009;4:393–411. - PMC - PubMed
    1. Ma X, Kulkarni A, Zhang Z, Xuan Z, Serfling R, Zhang MQ. A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information. Nucleic Acids Res. 2012;40:e50. - PMC - PubMed
    1. Kulakovskiy IV, Boeva VA, Favorov AV, Makeev VJ. Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics. 2010;26:2622–2623. - PubMed

Publication types

Feedback