Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies

J Math Biol. 2013 Oct;67(4):767-97. doi: 10.1007/s00285-012-0571-4. Epub 2012 Aug 9.


Mutation rate variation across loci is well known to cause difficulties, notably identifiability issues, in the reconstruction of evolutionary trees from molecular sequences. Here we introduce a new approach for estimating general rates-across-sites models. Our results imply, in particular, that large phylogenies are typically identifiable under rate variation. We also derive sequence-length requirements for high-probability reconstruction. Our main contribution is a novel algorithm that clusters sites according to their mutation rate. Following this site clustering step, standard reconstruction techniques can be used to recover the phylogeny. Our results rely on a basic insight: that, for large trees, certain site statistics experience concentration-of-measure phenomena.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Base Sequence / genetics
  • Cluster Analysis
  • Data Interpretation, Statistical*
  • Evolution, Molecular*
  • Models, Genetic*
  • Mutation*
  • Phylogeny*