Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Sep;64(5):778-91.
doi: 10.1093/sysbio/syv033. Epub 2015 Jun 1.

Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference

Affiliations

Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference

Ge Tan et al. Syst Biol. 2015 Sep.

Abstract

Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms.

Keywords: alignment filtering; alignment trimming; molecular phylogeny; multiple sequence alignment; phylogenetic inference; phylogenetics; phylogeny.

PubMed Disclaimer

Figures

F<sc>igure</sc> 1.
Figure 1.
Schematic of the species tree discordance test used to evaluate filtering methods. The grey (green in online version) elements indicate extra steps involved in the enriched version of the test. The tests sample sets of orthologs with an undisputed phylogeny (black sequences). The enriched test adds homologous sequences with unknown branching order (green in online version). The input sequences are aligned and then filtered by the different filtering methods. The filtered alignments are evaluated by reconstructing trees from them, which are compared with the reference topology (red in online version). In the enriched test, all additional sequences are removed from the tree and what remains (subtree relating the orthologous sequences) is compared to the reference topology. The unfiltered alignment is evaluated in the same way. All others being equal, the relative performance of the filtering methods can be assessed by their average congruence with the reference topology over a large number of input problems.
F<sc>igure</sc> 2.
Figure 2.
Alignment filtering generally yields poorer phylogenetic trees. Depicted here are results with the enriched species tree discordance test on amino acid (top) and nucleotide (bottom) alignments from three taxonomic ranges. The measure of error is the average RF distance between the reference trees and trees reconstructed from Prank + F alignments filtered by the various approaches. Trees were reconstructed using PhyML. Filtered alignments improving over unfiltered alignment fall in the gray region. The two dotted lines correspond to results obtained with two simplistic filtering methods (see main text). Points correspond to default parameters. Colored lines are linear interpolations between additional points obtained with non-default parameters (not available for all methods). Error bars indicate the standard error of the mean. If a filtering method with default parameters yields significantly different (two-sided Wilcoxon test, α=0.01) results from unfiltered alignments, a star is displayed below the corresponding point. Note that no multiple testing correction were applied.
F<sc>igure</sc> 3.
Figure 3.
Filtering not only increases the fraction of branches that are unresolved, but also often increases the fraction of resolved branches that are incorrect. Using approximate Bayesian posterior as the branch support measure (Anisimova et al. 2011), we considered branches below particular branch support values as unresolved (cutoff values in italics) in the enriched species discordance test on amino acid sequences.
F<sc>igure</sc> 4.
Figure 4.
Reanalysis on Ensembl Compara confirms main findings. Points correspond to filtering methods under default parameters. Filtered alignments improving over unfiltered alignment fall in the gray region. The two dotted lines correspond to results obtained with two simplistic filtering methods (see main text). Colored lines are linear interpolations between additional points obtained with non-default parameters and correspond to results obtained by varying the parameters of filtering methods (not available for TrimAl). If a filtering method with default parameters yields significantly different (two-sided Wilcoxon test, α=0.01) results than unfiltered alignments, a star is displayed below the corresponding point.
F<sc>igure</sc> 5.
Figure 5.
Effect of alignment filtering on simulated data (500 alignments with 30 sequences each): induced tree and alignment accuracy. Tree accuracy (left): the measure of error is the average RF distance between the reference trees and trees reconstructed from Prank + F alignments filtered by the various approaches. Trees were reconstructed using PhyML. Filtered alignments improving over unfiltered alignment fall in the grey region. The two dotted lines correspond to results obtained with two simplistic filtering methods (see main text). Points correspond to default parameters. If a filtering method with default parameters yields significantly different (two-sided Wilcoxon test, α=0.01) results from unfiltered alignments, a star is displayed below the corresponding point. Error bars indicate the standard error of the mean. Alignment accuracy (right): precision and recall for the various filtering methods, using sum-of-pair scoring function (see section “Methods”).

Similar articles

Cited by

References

    1. Altenhoff A.M., Schneider A., Gonnet G.H., Dessimoz C. 2011. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 39:D289–D294. - PMC - PubMed
    1. Anisimova M., Gil M., Dufayard J.-F., Dessimoz C., Gascuel O. 2011. Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. Syst. Biol. 60:685–699. - PMC - PubMed
    1. Bradley R.K., Roberts A., Smoot M., Juvekar S., Do J., Dewey C., Holmes I., Pachter L. 2009. Fast statistical alignment. PLOS Comput. Biol. 5:e1000392. - PMC - PubMed
    1. Capella-Gutiérrez S., Silla-Martínez J.M., Gabaldón T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972–1973. - PMC - PubMed
    1. Castresana J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540–552. - PubMed

Publication types