A statistical method for alignment-free comparison of regulatory sequences

Miriam R Kantorovitz; Gene E Robinson; Saurabh Sinha

doi:10.1093/bioinformatics/btm211

A statistical method for alignment-free comparison of regulatory sequences

Bioinformatics. 2007 Jul 1;23(13):i249-55. doi: 10.1093/bioinformatics/btm211.

Authors

Miriam R Kantorovitz¹, Gene E Robinson, Saurabh Sinha

Affiliation

¹ Department of Computer Science, University of Illinois, Urbana-Champaign, Illinois, USA.

PMID: 17646303
DOI: 10.1093/bioinformatics/btm211

Abstract

Motivation: The similarity of two biological sequences has traditionally been assessed within the well-established framework of alignment. Here we focus on the task of identifying functional relationships between cis-regulatory sequences that are non-orthologous or greatly diverged. 'Alignment-free' measures of sequence similarity are required in this regime.

Results: We investigate the use of a new score for alignment-free sequence comparison, called the score. It is based on comparing the frequencies of all fixed-length words in the two sequences. An important, novel feature of the score is that it is comparable across sequence pairs drawn from arbitrary background distributions. We present a method that gives quadratic improvement in the time complexity of calculating the score, over the naïve method. We then evaluate the score on several tissue-specific families of cis-regulatory modules (in Drosophila and human). The new score is highly successful in discriminating functionally related regulatory sequences from unrelated sequence pairs. The performance of the score is compared to five other alignment-free similarity measures, and shown to be consistently superior to all of these measures.

Availability: Our implementation of the score will be made freely available as source code, upon publication of this article, at: http://veda.cs.uiuc.edu/d2z/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Computer Simulation
Data Interpretation, Statistical
Models, Genetic*
Models, Statistical
Regulatory Sequences, Nucleic Acid / genetics*
Sequence Alignment / methods
Sequence Analysis, DNA / methods*
Sequence Homology, Nucleic Acid