Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan;4(1):17-27.
doi: 10.1038/ismej.2009.97. Epub 2009 Aug 27.

Fast UniFrac: Facilitating High-Throughput Phylogenetic Analyses of Microbial Communities Including Analysis of Pyrosequencing and PhyloChip Data

Affiliations
Free PMC article

Fast UniFrac: Facilitating High-Throughput Phylogenetic Analyses of Microbial Communities Including Analysis of Pyrosequencing and PhyloChip Data

Micah Hamady et al. ISME J. .
Free PMC article

Abstract

Next-generation sequencing techniques, and PhyloChip, have made simultaneous phylogenetic analyses of hundreds of microbial communities possible. Insight into community structure has been limited by the inability to integrate and visualize such vast datasets. Fast UniFrac overcomes these issues, allowing integration of larger numbers of sequences and samples into a single analysis. Its new array-based implementation offers orders of magnitude improvements over the original version. New 3D visualization of principal coordinates analysis results, with the option to view multiple coordinate axes simultaneously, provides a powerful way to quickly identify patterns that relate vast numbers of microbial communities. We show the potential of Fast UniFrac using examples from three data types: Sanger-sequencing studies of diverse free-living and animal-associated bacterial assemblages and from the gut of obese humans as they diet, pyrosequencing data integrated from studies of the human hand and gut, and PhyloChip data from a study of citrus pathogens. We show that a Fast UniFrac analysis using a reference tree recaptures patterns that could not be detected without considering phylogenetic relationships and that Fast UniFrac, coupled with BLAST-based sequence assignment, can be used to quickly analyze pyrosequencing runs containing hundreds of thousands of sequences, showing patterns relating human and gut samples. Finally, we show that the application of Fast UniFrac to PhyloChip data could identify well-defined subcategories associated with infection. Together, these case studies point the way toward a broad range of applications and show some of the new features of Fast UniFrac.

Figures

Figure 1
Figure 1
Difference in procedure between the original UniFrac and the new Fast UniFrac (for clarity, only the unweighted UniFrac algorithm is shown here, but similar principles apply to weighted UniFrac). In the original procedure, (A) environments are stored as sets in a tree object, (B) the tree is pruned to include only the branches leading to wanted environments, (C) the sets of environments are compared using set algorithms, states are assigned to each internal node, and (D) the result is calculated by another tree traversal. In the new procedure, (E) the environments are stored as an array of tip × environment counts, (F) selected environments are chosen by slicing this array, (G) internal states are calculated using array operations on slices of the array, and (H) the products of the incidence array and the branch lengths of nodes leading to either or both of the environments are summed, allowing calculation of the UniFrac value. The array-based approach allows substantial gains in efficiency.
Figure 2
Figure 2
Global Environmental Survey dataset (Ley et al., 2008b) analyzed using PCoA of unweighted pairwise UniFrac distances with trees generated using 1) megablast mapping to the Greengenes core set tree (A), 2) an ARB parsimony insertion tree (B) and 3) megablast mapping to the Greengenes core set represented as a star phylogeny (i.e. a phylogeny in which all taxa are treated as equally related, ignoring the actual phylogenetic information) (C). All plots show the first three principal axes as visualized in the 3D viewer. Scatterplots of the pairwise UniFrac distances (D, E), as well as the PCoA analysis, show that megablast to the Greengenes core set produced similar results as ARB parsimony insertion, but only when the phylogenetic relationships in the Greengenes core set are considered.
Figure 3
Figure 3
Principal coordinates analysis of Weighted UniFrac values between hand (blue) and gut (red) pyrosequencing datasets with the axes scaled by the percentage of the variance that they contain (A) or unscaled (B,C). Panel B plots PC1 vs PC2 and Panel C plots PC1 vs PC3. A parallel coordinates plot (D) allows visualization of which of the first 10 PC axes the hand vs. gut samples are varying across: in this display, the position of each sample along each of the first 10 axes is plotted (for example, the hand samples score high on PC1 and the gut samples score low, so on the first line, for PC1, the hand samples have high values and the gut samples have low values). A scree plot (E) allows for easy visualization of the % fraction of the variance explained by the first 10 PC axes, both individually (red) and cumulatively (blue).
Figure 4
Figure 4
Performance of Fast UniFrac versus original implementation on sample sizes ranging from 1000 to 10,000 sequences. Fast UniFrac implementation is consistently about 2 orders of magnitude faster, and largely eliminates the difference in time to calculate weighted and unweighted UniFrac metrics.
Figure 5
Figure 5
Example PhyoChip analysis performed using PhlyoTrac and Fast UniFrac. (A) Exporting the environment file from PhyloTrac, (B) uploading to Fast UniFrac, (C) viewing weighted Fast UniFrac PCoA results in the web interface directly (in this display, each point is a sample, and we see a 2D projection of the first two principal coordinates obtained by PCoA; the relatively smooth curve suggests that there is a gradient connecting the samples), (D) viewing unweighted Fast UniFrac ordination results in the linked 3D viewer: again, each point is a sample and the distances are calculated by PCoA of the UniFrac distances, but in this case three dimensions are shown, and (E) a scree plot showing how much of the variation is explained singly or cumulatively by each of the first 10 principal coordinates, allowing the user to see that, for example, the first three principal coordinates together explain over 80% of the variance in the samples. As reported in the original study, no clear patterns are readily seen using ordination, but demonstrates the speed and ease with which this sort analysis can now be performed.

Similar articles

See all similar articles

Cited by 422 articles

See all "Cited by" articles

References

    1. Alexander E, Stock A, Breiner HW, Behnke A, Bunge J, Yakimov MM, et al. Microbial eukaryotes in the hypersaline anoxic L'Atalante deep-sea basin. Environ Microbiol. 2009;11:360–81. - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. - PubMed
    1. Balakirev ES, Pavlyuchkov VA, Ayala FJ. DNA variation and symbiotic associations in phenotypically diverse sea urchin Strongylocentrotus intermedius. Proc Natl Acad Sci U S A. 2008;105:16218–23. - PMC - PubMed
    1. Bryant JA, Lamanna C, Morlon H, Kerkhoff AJ, Enquist BJ, Green JL. Colloquium paper: microbes on mountainsides: contrasting elevational patterns of bacterial and plant diversity. Proc Natl Acad Sci U S A. 2008;105(1):11505–11. - PMC - PubMed
    1. DeSantis TZ, Brodie EL, Moberg JP, Zubieta IX, Piceno YM, Andersen GL. High-density universal 16S rRNA microarray analysis reveals broader diversity than typical clone library when sampling the environment. Microb Ecol. 2007;53:371–83. - PubMed

Publication types

Feedback