MotifOrganizer: a scalable model-based motif clustering tool for mammalian genomes

Front Biosci (Elite Ed). 2013 Jan 1;5(2):785-97. doi: 10.2741/e659.

Abstract

Assembling a comprehensive catalog of all transcription factors (TFs) and the genes that they regulate (regulon) is important for understanding gene regulation. The sequence-specific conserved binding profiles of TFs can be characterized from whole genome sequences with phylogenetic approaches, and a large number of such profiles have been released. Effective mining of these data sources could reveal novel functional elements computationally. Due to the variability of the binding sites, it is necessary to generalize profiles pertinent to the same TF by clustering. The summarized familial profile is effective in identifying unknown binding sites, thus lead to gene co-regulation prediction. Here we report MotifOrganizer, a scalable model-based clustering algorithm designed for grouping motifs identified from large scale comparative genomics studies on mammalian species. The new algorithm allows grouping of motifs with variable widths and a novel two-stage operation scheme further increases the scalability. MotifOrgainzer demonstrated favorable performance comparing to distance-based and single-stage model-based clustering tools on simulated data. Tests on approximately 150k motifs from the cisRED human database demonstrated that MotifOrganizer can effectively cluster whole genome sets of mammalian motifs.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Animals
  • Binding Sites / genetics*
  • Cluster Analysis
  • Genome / genetics*
  • Genomics / methods*
  • Humans
  • Mammals / genetics*
  • Models, Genetic
  • Software*
  • Transcription Factors / genetics*

Substances

  • Transcription Factors