GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery

Leping Li

doi:10.1089/cmb.2008.16TT

GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery

J Comput Biol. 2009 Feb;16(2):317-29. doi: 10.1089/cmb.2008.16TT.

Author

Leping Li¹

Affiliation

¹ Biostatistics Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA. li3@niehs.nih.gov

Abstract

Genome-wide analyses of protein binding sites generate large amounts of data; a ChIP dataset might contain 10,000 sites. Unbiased motif discovery in such datasets is not generally feasible using current methods that employ probabilistic models. We propose an efficient method, GADEM, which combines spaced dyads and an expectation-maximization (EM) algorithm. Candidate words (four to six nucleotides) for constructing spaced dyads are prioritized by their degree of overrepresentation in the input sequence data. Spaced dyads are converted into starting position weight matrices (PWMs). GADEM then employs a genetic algorithm (GA), with an embedded EM algorithm to improve starting PWMs, to guide the evolution of a population of spaced dyads toward one whose entropy scores are more statistically significant. Spaced dyads whose entropy scores reach a pre-specified significance threshold are declared motifs. GADEM performed comparably with MEME on 500 sets of simulated "ChIP" sequences with embedded known P53 binding sites. The major advantage of GADEM is its computational efficiency on large ChIP datasets compared to competitors. We applied GADEM to six genome-wide ChIP datasets. Approximately, 15 to 30 motifs of various lengths were identified in each dataset. Remarkably, without any prior motif information, the expected known motif (e.g., P53 in P53 data) was identified every time. GADEM discovered motifs of various lengths (6-40 bp) and characteristics in these datasets containing from 0.5 to >13 million nucleotides with run times of 5 to 96 h. GADEM can be viewed as an extension of the well-known MEME algorithm and is an efficient tool for de novo motif discovery in large-scale genome-wide data. The GADEM software is available at (www.niehs.nih.gov/research/resources/software/GADEM/).

Publication types

Research Support, N.I.H., Extramural
Research Support, N.I.H., Intramural

MeSH terms

Algorithms*
Amino Acid Motifs / genetics*
Base Sequence
Cell Line
Chromatin Immunoprecipitation
Computational Biology / methods
Databases, Genetic*
Genome
Humans
Models, Genetic*
Models, Statistical
Molecular Sequence Data
Sequence Homology, Amino Acid
Software

Abstract

Publication types

MeSH terms

Grants and funding