Characterizing the phylogenetic tree-search problem

Syst Biol. 2012 Mar;61(2):228-39. doi: 10.1093/sysbio/syr097. Epub 2011 Nov 10.


Phylogenetic trees are important in many areas of biological research, ranging from systematic studies to the methods used for genome annotation. Finding the best scoring tree under any optimality criterion is an NP-hard problem, which necessitates the use of heuristics for tree-search. Although tree-search plays a major role in obtaining a tree estimate, there remains a limited understanding of its characteristics and how the elements of the statistical inferential procedure interact with the algorithms used. This study begins to answer some of these questions through a detailed examination of maximum likelihood tree-search on a wide range of real genome-scale data sets. We examine all 10,395 trees for each of the 106 genes of an eight-taxa yeast phylogenomic data set, then apply different tree-search algorithms to investigate their performance. We extend our findings by examining two larger genome-scale data sets and a large disparate data set that has been previously used to benchmark the performance of tree-search programs. We identify several broad trends occurring during tree-search that provide an insight into the performance of heuristics and may, in the future, aid their development. These trends include a tendency for the true maximum likelihood (best) tree to also be the shortest tree in terms of branch lengths, a weak tendency for tree-search to recover the best tree, and a tendency for tree-search to encounter fewer local optima in genes that have a high information content. When examining current heuristics for tree-search, we find that nearest-neighbor-interchange performs poorly, and frequently finds trees that are significantly different from the best tree. In contrast, subtree-pruning-and-regrafting tends to perform well, nearly always finding trees that are not significantly different to the best tree. Finally, we demonstrate that the precise implementation of a tree-search strategy, including when and where parameters are optimized, can change the character of tree-search, and that good strategies for tree-search may combine existing tree-search programs.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Classification / methods
  • Computational Biology
  • Genome
  • Likelihood Functions
  • Models, Biological
  • Phylogeny*
  • Yeasts / classification
  • Yeasts / genetics*