Identifying cis-regulatory sequences by word profile similarity

Garmay Leung; Michael B Eisen

doi:10.1371/journal.pone.0006901

Identifying cis-regulatory sequences by word profile similarity

PLoS One. 2009 Sep 4;4(9):e6901. doi: 10.1371/journal.pone.0006901.

Authors

Garmay Leung¹, Michael B Eisen

Affiliation

¹ University of California Berkeley and University of California San Francisco Joint Graduate Group in Bioengineering, University of California, Berkeley, California, United States of America. garmay@berkeley.edu

Abstract

Background: Recognizing regulatory sequences in genomes is a continuing challenge, despite a wealth of available genomic data and a growing number of experimentally validated examples.

Methodology/principal findings: We discuss here a simple approach to search for regulatory sequences based on the compositional similarity of genomic regions and known cis-regulatory sequences. This method, which is not limited to searching for predefined motifs, recovers sequences known to be under similar regulatory control. The words shared by the recovered sequences often correspond to known binding sites. Furthermore, we show that although local word profile clustering is predictive for the regulatory sequences involved in blastoderm segmentation, local dissimilarity is a more universal feature of known regulatory sequences in Drosophila.

Conclusions/significance: Our method leverages sequence motifs within a known regulatory sequence to identify co-regulated sequences without explicitly defining binding sites. We also show that regulatory sequences can be distinguished from surrounding sequences by local sequence dissimilarity, a novel feature in identifying regulatory sequences across a genome. Source code for WPH-finder is available for download at http://rana.lbl.gov/downloads/wph.tar.gz.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Amino Acid Motifs
Animals
Binding Sites
Cluster Analysis
Computational Biology / methods*
Conserved Sequence / genetics
Databases, Genetic
Drosophila / genetics
Genome
Genomics
Humans
Language*
Models, Genetic
Regulatory Sequences, Nucleic Acid*
Software

Abstract

Publication types

MeSH terms

Grants and funding