On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled

Mol Biol Evol. 2004 Sep;21(9):1629-42. doi: 10.1093/molbev/msh159. Epub 2004 May 21.

Abstract

Using analytical methods, we show that under a variety of model misspecifications, Neighbor-Joining, minimum evolution, and least squares estimation procedures are statistically inconsistent. Failure to correctly account for differing rates-across-sites processes, failure to correctly model rate matrix parameters, and failure to adjust for parallel rates-across-sites changes (a rates-across-subtrees process) are all shown to lead to a "long branch attraction" form of inconsistency. In addition, failure to account for rates-across-sites processes is also shown to result in underestimation of evolutionary distances for a wide variety of substitution models, generalizing an earlier analytical result for the Jukes-Cantor model reported in Golding and a similar bias result for the GTR or REV model in Kelly and Rice (1996). Although standard rates-across-sites models can be employed in many of these cases to restore consistency, current models cannot account for other kinds of misspecification. We examine an idealized but biologically relevant case, where parallel changes in rates at sites across subtrees is shown to give rise to inconsistency. This changing rates-across-subtrees type model misspecification cannot be adjusted for with conventional methods or without carefully considering the rate variation in the larger tree. The results are presented for four-taxon trees, but the expectation is that they have implications for larger trees as well. To illustrate this, a simulated 42-taxon example is given in which the microsporidia, an enigmatic group of eukaryotes, are incorrectly placed at the archaebacteria-eukaryotes split because of incorrectly specified pairwise distances. The analytical nature of the results lend insight into the reasons that long branch attraction tends to be a common form of inconsistency and reasons that other forms of inconsistency like "long branches repel" can arise in some settings. In many of the cases of inconsistency presented, a particular incorrect topology is estimated with probability converging to one, the implication being that measures of uncertainty like bootstrap support will be unable to detect that there is a problem with the estimation. The focus is on distance methods, but previous simulation results suggest that the zones of inconsistency for distance methods contain the zones of inconsistency for maximum likelihood methods as well.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Archaea / genetics
  • Bacteria / genetics
  • Biometry / methods*
  • Eukaryotic Cells
  • Evolution, Molecular*
  • Genetic Variation
  • Least-Squares Analysis
  • Models, Genetic*
  • Phylogeny