Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Nov 18;6(12):3199-209.
doi: 10.1093/gbe/evu252.

Indel Reliability in Indel-Based Phylogenetic Inference

Affiliations
Free PMC article

Indel Reliability in Indel-Based Phylogenetic Inference

Haim Ashkenazy et al. Genome Biol Evol. .
Free PMC article

Abstract

It is often assumed that it is unlikely that the same insertion or deletion (indel) event occurred at the same position in two independent evolutionary lineages, and thus, indel-based inference of phylogeny should be less subject to homoplasy compared with standard inference which is based on substitution events. Indeed, indels were successfully used to solve debated evolutionary relationships among various taxonomical groups. However, indels are never directly observed but rather inferred from the alignment and thus indel-based inference may be sensitive to alignment errors. It is hypothesized that phylogenetic reconstruction would be more accurate if it relied only on a subset of reliable indels instead of the entire indel data. Here, we developed a method to quantify the reliability of indel characters by measuring how often they appear in a set of alternative multiple sequence alignments. Our approach is based on the assumption that indels that are consistently present in most alternative alignments are more reliable compared with indels that appear only in a small subset of these alignments. Using simulated and empirical data, we studied the impact of filtering and weighting indels by their reliability scores on the accuracy of indel-based phylogenetic reconstruction. The new method is available as a web-server at http://guidance.tau.ac.il/RELINDEL/.

Keywords: alignment reliability; indel analysis; multiple sequence alignment; phylogeny.

Figures

F<sc>ig</sc>. 1.—
Fig. 1.—
The agreement regarding indel characters derived from three common MSA algorithms: MAFFT, PRANK, and CLUSTALW (A) using all indels and (B) using the most reliable indel characters identified by RELINDEL.
F<sc>ig</sc>. 2.—
Fig. 2.—
Phylogenetic trees reconstructed using all indel characters coded from MSAs produced by (A) PRANK, (B) MAFFT, and (C) CLUSTALW. When using indels derived from the PRANK MSAs, the obtained tree significantly differed from the accepted primate tree. The red branch shows the misplacement of Gorilla in the PRANK-based inference. Additional statistical information is provided in panel (D) (Informative, number of informative characters; CI, consistence index; RI, retention index).
F<sc>ig</sc>. 3.—
Fig. 3.—
MSAs and corresponding indel character matrices for the first 40 amino acids of the human AGPS gene (ENSG00000018510) as inferred by (A) PRANK, (B) MAFFT, and (C) CLUSTALW. Homoplasious indels, which conflict the accepted primate tree, are boxed in yellow. The three alignment methods highly disagree on the placement of these indels. RELINDEL identifies these indels as highly unreliable (see text).
F<sc>ig</sc>. 4.—
Fig. 4.—
Distribution of the indel-reliability scores for (A) PRANK, (B) MAFFT, and (C) CLUSTALW as a function of indel length.
F<sc>ig</sc>. 5.—
Fig. 5.—
Phylogenetic trees reconstructed using the most reliable indels characters coded from MSAs produced by (A) PRANK, (B) MAFFT, and (C) CLUSTALW and filtered by the RELINDEL method. The correct primate phylogeny was reconstructed when using indels derived from both PRANK and MAFFT. Homo is misplaced in the tree reconstructed based on CLUSTALW MSAs (the erroneous branch is marked in red). Additional statistical information is provided in panel (D) (Informative, number of informative characters; CI, consistence index; RI, retention index).
F<sc>ig</sc>. 6.—
Fig. 6.—
ROC curves, quantifying the ability of RELINDEL to accurately detect reliable indels based on simulated data. The AUC is given in parenthesis next to each alignment algorithm. ROC curves for simulations with (A) symmetric tree and (B) asymmetric tree.

Similar articles

See all similar articles

Cited by 8 articles

See all "Cited by" articles

References

    1. Adhikari AN, et al. Modeling large regions in proteins: applications to loops, termini, and folding. Protein Sci. 2012;21:107–121. - PMC - PubMed
    1. Ajawatanawong P, Atkinson GC, Watson-Haigh NS, Mackenzie B, Baldauf SL. SeqFIRE: a web application for automated extraction of indel regions and conserved blocks from protein multiple sequence alignments. Nucleic Acids Res. 2012;40:W340–W347. - PMC - PubMed
    1. Bapteste E, Philippe H. The potential value of indels as phylogenetic markers: position of trichomonads as a case study. Mol Biol Evol. 2002;19:972–977. - PubMed
    1. Belinky F, Cohen O, Huchon D. Large-scale parsimony analysis of metazoan indels in protein-coding genes. Mol Biol Evol. 2010;27:441–451. - PubMed
    1. Blackburne BP, Whelan S. Class of multiple sequence alignment algorithm affects genomic analysis. Mol Biol Evol. 2013;30:642–653. - PubMed

Publication types

Feedback