Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 8, 385

Discriminative Motif Discovery in DNA and Protein Sequences Using the DEME Algorithm

Affiliations

Discriminative Motif Discovery in DNA and Protein Sequences Using the DEME Algorithm

Emma Redhead et al. BMC Bioinformatics.

Abstract

Background: Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.

Results: We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.

Conclusion: Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at http://bioinformatics.org.au/deme/

Figures

Figure 1
Figure 1
The DEME data model. Labels on arcs show the probabilities of choosing the labelled class, C, of a sequence, and the true class, T. When T = 0, sequences are generated using just the background model, θB. When T = 1, sequences contain a motif site, generated by motif model θM, inserted in random sequence generated by θB.
Figure 2
Figure 2
Effect of the Bayesian motif prior on local search accuracy. The plot shows the average accuracy of the motif models discovered by conjugate gradient alone as a function of A, the total pseudocounts applied when deriving the PSFM from W. The starting point for conjugate gradient is derived from the consensus sequence for the planted motif using a value of B = 0.25. All experiments use the Random Negative Problem and DNA sequences and the OOPS data model. Each data point is the arithmetic mean (± standard error) for 100 independent experiments. Panel a shows results using FM motifs and panel b shows results using PSFM motifs.
Figure 3
Figure 3
Effect of the seed prior on DEME accuracy. The plots show the average accuracy of the motifs discovered by DEME on different synthetic discriminative problems as a function of the size of the seed prior, B. The Bayesian motif prior is set to A = 4 for FM problems, and A = 1 for PSFM problems. All results are for DNA sequences, and DEME uses the OOPS data model in all cases. Each data point is the arithmetic mean ± standard error for 100 independent experiments.
Figure 4
Figure 4
Comparison of DEME and MEME on synthetic problems. Each plot shows the accuracy of predicted motifs as measured by the training set PC. Each data point represents the mean (± standard error) PC on 100 independent instantiations of the given problem. Panel a shows results on the FM Random Negative Problem. Panel b shows results for the FM Decoy Motif Problem with zero mutations in the occurrences of the decoy motif as a function of the number of mutations in the target motif sites. Panel c shows results on the FM Variant Motif Problem. The variant motif is Hamming distance four from the target motif and planted instances of the target and variant motifs contain the same number of mutations. Panel d shows results on the width-10 PSFM Impoverished Negative Problem (and width-10 PSFM Random Negative Problem for comparison) as a function of the length of the sequences. In all tests, DEME is run using a positive and negative training set, while MEME is applied to the positive training set only.
Figure 5
Figure 5
Discriminative motifs for thermophilic vs. mesophilic TATA-box proteins. Each column shows the aligned LOGOs from a single experiment. Column a shows the motif found by DEME using the thermophilic proteins as the positive set. Column b shows the DEME motif when the mesophilic proteins are used as the positive set. In each case, the upper LOGO illustrates the residue preferences in the motif sites reported by DEME in the thermophilic sequences.
Figure 6
Figure 6
Running time of the DEME algorithm. The plot shows the CPU time required by DEME on a typical PC using an typical ChIP-chip (positive) dataset, as a function of the number of non-binding probe sequences as the negative set. The positive set contains 59 probe sequences with an average length of 564 nt that bind GCN4 in rich media (the ChIP-chip datasets compiled by Harbison et al. [5] contain on average 40 probe sequences of length 564 nt). Each negative dataset contains randomly selected non-binding probe sequences. The largest negative set studied contains all non-binding probe sequences.

Similar articles

See all similar articles

Cited by 40 articles

See all "Cited by" articles

References

    1. Tompa M, Li N, Bailey TL, Church GM, Moor BD, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–144. - PubMed
    1. Hu JJ, Li B, Kihara D. Limitations and potentials of current motif discovery algorithms. Nucleic Acids Research. 2005;33:4899–4913. - PMC - PubMed
    1. Fang J, Haasl RJ, Dong Y, Lushington GH. Discover protein sequence signatures from protein-protein interaction data. BMC Bioinformatics. 2005;6:277. - PMC - PubMed
    1. Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol. 2002;20:835–839. - PubMed
    1. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431:99–104. - PMC - PubMed

Publication types

LinkOut - more resources

Feedback