Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jul;197(3):925-37.
doi: 10.1534/genetics.114.161299. Epub 2014 May 1.

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data

Affiliations

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data

John D O'Brien et al. Genetics. 2014 Jul.

Abstract

Metagenomics provides a powerful new tool set for investigating evolutionary interactions with the environment. However, an absence of model-based statistical methods means that researchers are often not able to make full use of this complex information. We present a Bayesian method for inferring the phylogenetic relationship among related organisms found within metagenomic samples. Our approach exploits variation in the frequency of taxa among samples to simultaneously infer each lineage haplotype, the phylogenetic tree connecting them, and their frequency within each sample. Applications of the algorithm to simulated data show that our method can recover a substantial fraction of the phylogenetic structure even in the presence of high rates of migration among sample sites. We provide examples of the method applied to data from green sulfur bacteria recovered from an Antarctic lake, plastids from mixed Plasmodium falciparum infections, and virulent Neisseria meningitidis samples.

Keywords: Bayesian phylogenetics; metagenomics; microevolution.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A diagram of the lineage model. On the left, a coalescent process leads to a complete genealogy, with the tips marked by pool as colors. The right diagrams the lineage model approximation, showing deep branching events together with cones shading the SNP variation indistinguishable from noise.
Figure 2
Figure 2
Comparison between simulated tree and pool proportions (left) and inferred trees and pool proportions (right). In the inferred model, the dark blue tree shows the maximum posterior probability tree with the light blue trees representing samples from the MCMC. Pie charts on the right show inferred pool proportions, with shades indicating the posterior percentile from dark (5%) to light (95%).
Figure 3
Figure 3
(Left) Percentage of SNP similarity between simulated and inferred lineages. (Right) Comparison between pool proportions for simulated (light gray) and inferred (dark gray) values for each simulated lineage. Blue circles show combined proportions for simulated lineages 2 and 3.
Figure 4
Figure 4
Comparison of simulated and inferred values for lineages (left column) and pool proportions (right column) by number of SNPS (A and B), number of reads (C and D), and error rate (E and F). Insets give mixture value (“Mix”), number of read counts (“Reads”), number of SNPs (“SNPs”), and error rate (“Err”).
Figure 5
Figure 5
(Top) Location on simulated tree of SNPs for six sequence patterns. The branch width is proportional to number of SNPs. (Bottom) Inferred model presentation is the same as in Figure 2.
Figure 6
Figure 6
Inferred lineage model for Chlorobium data from Ace Lake and open ocean samples.
Figure 7
Figure 7
Inferred lineage model for Plasmodium falciparum apicoplast data from 20 clinical samples from northern Ghana.

Similar articles

Cited by

References

    1. Ahiska, B., 2011 Reference-free identification of variation in metagenomic sequence data using a statistical model. Ph.D. Thesis, University of Oxford, Oxford.
    1. Allen E. E., Banfield J. F., 2005. Community genomics in microbial ecology and evolution. Nat. Rev. Microbiol. 3: 489–498. - PubMed
    1. Balding D., Nichols R., 1995. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96: 3–12. - PubMed
    1. Bentley D. R., Balasubramanian S., Swerdlow H. P., Smith G. P., Milton J., et al. , 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59. - PMC - PubMed
    1. Berger S. A., Stamatakis A., 2011. Aligning short reads to reference alignments and trees. Bioinformatics 27: 2068–2075. - PubMed

LinkOut - more resources