Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 1;68(1):117-130.
doi: 10.1093/sysbio/syy036.

Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction

Free PMC article

Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction

Haim Ashkenazy et al. Syst Biol. .
Free PMC article


The classic methodology of inferring a phylogenetic tree from sequence data is composed of two steps. First, a multiple sequence alignment (MSA) is computed. Then, a tree is reconstructed assuming the MSA is correct. Yet, inferred MSAs were shown to be inaccurate and alignment errors reduce tree inference accuracy. It was previously proposed that filtering unreliable alignment regions can increase the accuracy of tree inference. However, it was also demonstrated that the benefit of this filtering is often obscured by the resulting loss of phylogenetic signal. In this work we explore an approach, in which instead of relying on a single MSA, we generate a large set of alternative MSAs and concatenate them into a single SuperMSA. By doing so, we account for phylogenetic signals contained in columns that are not present in the single MSA computed by alignment algorithms. Using simulations, we demonstrate that this approach results, on average, in more accurate trees compared to 1) using an unfiltered MSA and 2) using a single MSA with weights assigned to columns according to their reliability. Next, we explore in which regions of the MSA space our approach is expected to be beneficial. Finally, we provide a simple criterion for deciding whether or not the extra effort of computing a SuperMSA and inferring a tree from it is beneficial. Based on these assessments, we expect our methodology to be useful for many cases in which diverged sequences are analyzed. The option to generate such a SuperMSA is available at


<sc>Figure</sc> 1.
Figure 1.
The difference in average CS between alternative MSAs and the base MSA when using MAFFT (a–c) and PRANK (d–f) as the alignment method, for the PAM250 (a, d), PAM100 (b, e), and ENSEMBLsim (c, f) data sets. A positive value indicates that an alternative MSA is more accurate than the base MSA.
<sc>Figure</sc> 2.
Figure 2.
Inaccurate alternative MSAs do not point at a single wrong topology. For each of the 249 MAFFT-PAM250 cases, the SuperMSA’s alternative MSAs are divided into two sets: bMSAs (more accurate than the base MSA) and wMSAs (less accurate than the base MSA). We inferred a phylogenetic tree based on each single MSA in the bMSA (wMSA) group. Each point represents a single case, for which two median normRF distances were computed: one for the bMSA (formula imageaxis) and one for the wMSA (formula image-axis). Trees inferred based on bMSAs are generally closer to the true tree than trees inferred based on wMSAs (a). The level of agreement among trees of each set is measured by the median normRF distance of all pairs in the group, with a lower median distance indicating a higher level of agreement within the set. Sets of bMSA trees have higher agreement than sets of wMSA trees (b).
<sc>Figure</sc> 3.
Figure 3.
The fraction of alternative MSAs that are more accurate than the base MSA as a function of the base MSA accuracy (measured by the SPC score). Each data set is plotted separately: MAFFT-PAM250 (a), MAFFT-PAM100 (b), MAFFT-ENSEMBLsim (c), PRANK-PAM250 (d), PRANK-100 (e), and PRANK-ENSEMBLsim (f). The linear regression line, R-squared, and formula image-value are indicated for each data set.
<sc>Figure</sc> 4.
Figure 4.
The fraction of improved trees when considering alternative MSAs as a function of the base MSA accuracy scored by GUIDANCE2. Base MSAs were divided into bins according to their GUIDANCE2 SPC score, where lower scores represent less accurate MSAs. The number of improved trees and the total number of trees are indicated inside the bar for each bin. Each data set is plotted separately: MAFFT-PAM250 (a), MAFFT-PAM100 (b), MAFFT-ENSEMBLsim (c), PRANK-PAM250 (d), PRANK-100 (e), and PRANK-ENSEMBLsim (f).
<sc>Figure</sc> 5.
Figure 5.
The accuracy of the inferred tree and running time as a function the number of alternative MSAs in the SuperMSA calculated for the MAFFT-PAM250 data set.

Similar articles

See all similar articles

Cited by 2 articles

Publication types