Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jun 24:3:7.
doi: 10.1186/1748-7188-3-7.

Noisy: identification of problematic columns in multiple sequence alignments

Affiliations

Noisy: identification of problematic columns in multiple sequence alignments

Andreas W M Dress et al. Algorithms Mol Biol. .

Abstract

Motivation: Sequence-based methods for phylogenetic reconstruction from (nucleic acid) sequence data are notoriously plagued by two effects: homoplasies and alignment errors. Large evolutionary distances imply a large number of homoplastic sites. As most protein-coding genes show dramatic variations in substitution rates that are not uncorrelated across the sequence, this often leads to a patchwork pattern of (i) phylogenetically informative and (ii) effectively randomized regions. In highly variable regions, furthermore, alignment errors accumulate resulting in sometimes misleading signals in phylogenetic reconstruction.

Results: We present here a method that, based on assessing the distribution of character states along a cyclic ordering of the taxa, allows the identification of phylogenetically uninformative homoplastic sites in a multiple sequence alignment. Removal of these sites appears to improve the performance of phylogenetic reconstruction algorithms as measured by various indices of "tree quality". In particular, we obtain more stable trees due to the exclusion of phylogenetically incompatible sites that most likely represent strongly randomized characters.

Software: The computer program noisy implements this approach. It can be employed to improving phylogenetic reconstruction capability with quite a considerable success rate whenever (1) the average bootstrap support obtained from the original alignment is low, and (2) there are sufficiently many taxa in the data set - at least, say, 12 to 15 taxa. The software can be obtained under the GNU Public License from http://www.bioinf.uni-leipzig.de/Software/noisy/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Number of cyclic orderings of a set 10 complete mitochondrial genomes with a prescribed fraction of "noisy" characters, i.e., q(C, χ) ≤ 0.8). The cyclic orderings computed by NeighborNet or QNet indeed essentially minimize the fraction of putative randomized alignment sites. At least in this example, QNet with quartet-mapping-derived quartet weights performs best. "ClustalW" refers to the circular ordering implicitly constructed by ClustalW from its guide tree which determines the order in which sequences and profiles are combined to yield the final alignment.
Figure 2
Figure 2
Distribution of homoplastic sites for the mitochondrial atp6 gene of squamata (2047 positions, above) and for 18S RNA of Coleoptera from an analysis of [37] (684 positions, below). In terms of quality, the two data sets are very different. While the majority of sites in atp6 are parsimony informative and approximately one third of the sites have a reliability score above qcutoff = 0.8, this is clearly not the case for the data set by [37] where most of the sites are constant or unreliable. The black bar below the alignment indicates whether the q-value of the corresponding position is above (upper half) or below (lower half) the cutoff value. Note that only green positions have a chance to having q-value above the cutoff value.
Figure 3
Figure 3
MP trees of spatangoid sea urchins from combined 28S rRNA, 16S rRNA, and mitochondrial COI sequences [25]. L.h.s. from original data, r.h.s. from a reduced alignment with cutoff q = 0.8. The latter tree matches the biological expectation and fits very well with those reported in [25] that were obtained from a manually reduced alignment. In particular, the noisy-reduced MP tree correctly shows Brissopsis and Allobrissus as sister groups and it correctly identifies the large monophyletic clade consisting of the Linopneustes/Metalia and Lovenia/Spatangus groups to the exclusion of Meoma and Archeopneustes. These major improvements are marked with a bullet. The included table compares the stability indices (HI = homoplasy index, RC = rescaled consistency index, RI = retention index) between the complete (unprocessed), Stockley's manually improved, and the noisy-reduced alignment.
Figure 4
Figure 4
Dependency of tree-quality indices on the cut-off value qcutoff for the protein-coding mitochondrial genes from all 31 currently available squamata. The stability of the trees is measured by the scaled log likelihood (ln L)/n, the homoplasy index (HI) [27], and the rescaled consistency index (RC) [28] as computed by PAUP 4.0b10 [26]. Data sets are alignments (supplied in the electronic supplement) of individual mitochondrial protein-coding genes. They vary in size (from about 170 to 1800 nt) and randomization.
Figure 5
Figure 5
The relative average bootstrap support of phylogenetic trees is computed as the ratio of the average bootstrap support for the modified alignments divided by the bootstrap support obtained from the original alignment. Values larger than 1 indicate an increase in tree robustness. The curves show a distinct maximum that depends on the number of taxa and the topology of the tree. The maximum improvement increases with the number of taxa (indicated on the right margin of both panels for the highlighted curves). For clarity, error bars obtained from 100 replicates are shown only for N = 10 and N = 25 taxa. The tree topologies, caterpillar trees on the left and balanced trees on the right, are depicted by the insets.

Similar articles

Cited by

References

    1. Ogden TH, Rosenberg M. Multiple Sequence Alignment Accuracy and Phylogenetic Inference. Syst Biol. 2006;55:314–328. doi: 10.1080/10635150500541730. - DOI - PubMed
    1. Landan G, Graur D. Heads or tails: a simple reliability check for multiple sequence alignments. Mol Biol Evol. 2007;24:1380–1383. doi: 10.1093/molbev/msm060. - DOI - PubMed
    1. Björklund M. Are Third Positions Really That Bad? A Test Using Vertebrate Cytochrome b. Cladistics. 1999;15:91–97. - PubMed
    1. Yang Z. On the best evolutionary rate for phylogenetic analysis. Syst Biol. 1998;47:125–133. doi: 10.1080/106351598261067. - DOI - PubMed
    1. Wägele JW. Foundations of Phylogenetic Systematics. Munich, Germany: Verlag Dr Friedrich Pfeil; 2005.