Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies

J Math Biol. 2013 Oct;67(4):767-97. doi: 10.1007/s00285-012-0571-4. Epub 2012 Aug 9.

Abstract

Mutation rate variation across loci is well known to cause difficulties, notably identifiability issues, in the reconstruction of evolutionary trees from molecular sequences. Here we introduce a new approach for estimating general rates-across-sites models. Our results imply, in particular, that large phylogenies are typically identifiable under rate variation. We also derive sequence-length requirements for high-probability reconstruction. Our main contribution is a novel algorithm that clusters sites according to their mutation rate. Following this site clustering step, standard reconstruction techniques can be used to recover the phylogeny. Our results rely on a basic insight: that, for large trees, certain site statistics experience concentration-of-measure phenomena.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Base Sequence / genetics
  • Cluster Analysis
  • Data Interpretation, Statistical*
  • Evolution, Molecular*
  • Models, Genetic*
  • Mutation*
  • Phylogeny*