Heterogeneous molecular processes among the causes of how sequence similarity scores can fail to recapitulate phylogeny

Brief Bioinform. 2017 May 1;18(3):451-457. doi: 10.1093/bib/bbw034.

Abstract

Sequence similarity tools like Basic Local Alignment Search Tool (BLAST) are essential components of many functional genetic, genomic, phylogenetic and bioinformatic studies. Many modern analysis pipelines use significant sequence similarity scores (p- or E-values) and the ranked order of BLAST matches to test a wide range of hypotheses concerning homology, orthology, the timing of de novo gene birth/death and gene family expansion/contraction. Despite significant contrary findings, many of these tests still implicitly assume that stronger or higher-ranked E-value scores imply closer phylogenetic relationships between sequences. Here, we demonstrate that even though a general relationship does exist between the phylogenetic distance of two sequences and their E-value, significant and misleading errors occur in both the completeness and the order of results under realistic evolutionary scenarios. These results provide additional details to past evidence showing that studies should avoid drawing direct inferences of evolutionary relatedness from measures of sequence similarity alone, and should instead, where possible, use more rigorous phylogeny-based methods.

Keywords: BLAST; compositional bias; phylogenetics; phylostratigraphy; rate heterogeneity; sequence similarity.

MeSH terms

  • Computational Biology
  • Phylogeny*
  • Sequence Alignment
  • Software