Motif Yggdrasil: sampling sequence motifs from a tree mixture model

J Comput Biol. 2007 Jun;14(5):682-97. doi: 10.1089/cmb.2007.R010.

Abstract

In phylogenetic foot-printing, putative regulatory elements are found in upstream regions of orthologous genes by searching for common motifs. Motifs in different upstream sequences are subject to mutations along the edges of the corresponding phylogenetic tree, consequently taking advantage of the tree in the motif search is an appealing idea. We describe the Motif Yggdrasil sampler; the first Gibbs sampler based on a general tree that uses unaligned sequences. Previous tree-based Gibbs samplers have assumed a star-shaped tree or partially aligned upstream regions. We give a probabilistic model (MY model) describing upstream sequences with regulatory elements and build a Gibbs sampler with respect to this model. The model allows toggling, i.e., the restriction of a position to a subset of nucleotides, but does not require aligned sequences nor edge lengths, which may be difficult to come by. We apply the collapsing technique to eliminate the need to sample nuisance parameters, and give a derivation of the predictive update formula. We show that the MY model improves the modeling of difficult motif instances and that the use of the tree achieves a substantial increase in nucleotide level correlation coefficient both for synthetic data and 37 bacterial lexA genes. We investigate the sensitivity to errors in the tree and show that using random trees MY sampler still has a performance similar to the original version.

MeSH terms

  • Bacterial Proteins / genetics
  • Binding Sites / genetics
  • DNA Footprinting / methods*
  • DNA Footprinting / trends
  • Escherichia coli Proteins / genetics
  • Markov Chains
  • Models, Genetic*
  • Phylogeny*
  • Regulatory Elements, Transcriptional / genetics*
  • Sequence Alignment / methods
  • Sequence Analysis, DNA / methods
  • Sequence Analysis, DNA / trends
  • Serine Endopeptidases / genetics
  • Transcription Factors / metabolism*

Substances

  • Bacterial Proteins
  • Escherichia coli Proteins
  • LexA protein, Bacteria
  • Transcription Factors
  • Serine Endopeptidases