Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Apr 17;3(3):e00021-18.
doi: 10.1128/mSystems.00021-18. eCollection 2018 May-Jun.

Phylogenetic Placement of Exact Amplicon Sequences Improves Associations with Clinical Information

Affiliations

Phylogenetic Placement of Exact Amplicon Sequences Improves Associations with Clinical Information

Stefan Janssen et al. mSystems. .

Abstract

Recent algorithmic advances in amplicon-based microbiome studies enable the inference of exact amplicon sequence fragments. These new methods enable the investigation of sub-operational taxonomic units (sOTU) by removing erroneous sequences. However, short (e.g., 150-nucleotide [nt]) DNA sequence fragments do not contain sufficient phylogenetic signal to reproduce a reasonable tree, introducing a barrier in the utilization of critical phylogenetically aware metrics such as Faith's PD or UniFrac. Although fragment insertion methods do exist, those methods have not been tested for sOTUs from high-throughput amplicon studies in insertions against a broad reference phylogeny. We benchmarked the SATé-enabled phylogenetic placement (SEPP) technique explicitly against 16S V4 sequence fragments and showed that it outperforms the conceptually problematic but often-used practice of reconstructing de novo phylogenies. In addition, we provide a BSD-licensed QIIME2 plugin (https://github.com/biocore/q2-fragment-insertion) for SEPP and integration into the microbial study management platform QIITA. IMPORTANCE The move from OTU-based to sOTU-based analysis, while providing additional resolution, also introduces computational challenges. We demonstrate that one popular method of dealing with sOTUs (building a de novo tree from the short sequences) can provide incorrect results in human gut metagenomic studies and show that phylogenetic placement of the new sequences with SEPP resolves this problem while also yielding other benefits over existing methods.

Keywords: SEPP; amplicon sequencing; microbial community analysis; phylogenetic placement.

PubMed Disclaimer

Figures

FIG 1
FIG 1
Comparing read recruitment, de novo, and insertion tree strategies for phylogenetic diversity computation. (A) Ideally, all short amplicon fragments (red) would have known full-length 16S sequences (black), which in turn would allow reconstruction of a phylogenetic tree. (B) In real-world experiments, only a minority of fragments have corresponding full-length 16S references. (C) The “read recruitment” strategy, also known as closed-reference OTU picking, assigns fragments to tips of a well-curated reference phylogeny, e.g., Greengenes, with a given sequence similarity threshold. Fragments of clades not covered in the reference are rejected. (D) In order to keep all fragments, the de novo strategy reconstructs the whole phylogeny based on the short fragments that do not carry as much evolutionary signal as full-length 16S sequences and thus often results in topologically very different trees. (E) The insertion tree strategy takes advantage of a well-curated phylogeny and extends it with fragments obtained by experiment. Only highly unrelated fragments are rejected, while the overall topology of the resulting phylogenetic trees remains stable.
FIG 2
FIG 2
SEPP avoids artificially long outgroup branches that would lead to exaggerated separation in beta diversity data. (A) Principal-coordinate analysis (PCoA) of unweighted UniFrac distances based on a de novo phylogeny. Three low-abundance Methanobrevibacter sOTUs, not detectable in the lower gray cluster and of very low abundance in the upper colored cluster, drove a spurious separation of 599 stool samples obtained from participants of the MrOS Study. (B) Manually shortening the grandparent’s branch length from 0.82 to 0.4 in the de novo phylogeny reunited spurious clusters. (C) Inserting de novo fragments into a well-curated reference phylogeny via SEPP also resolved cluster separation but did not require any manual manipulation.
FIG 3
FIG 3
Higher sub-OTU resolution, in combination with SEPP phylogenies, exposed relevant ecological signals. (A to D) For the Malawi children, the same 7,554,708 reads from 179 samples (150 nt; mean number of reads per sample, 42,205) were processed by “closed-reference” OTU picking (A), “open-reference” OTU picking against the same reference database (B), and the sub-OTU method “Deblur” (C), and correlation via Mantel tests for unweighted Unifrac beta diversities were computed (D). (E to G) For the Alaska birds, a total of 5,932,450 reads from 137 samples (125 nt; mean number of reads per sample, 43,303) were processed with both methods mentioned above. Pairwise testing between sample groups was performed via PERMANOVA with 9,999 permutations. Statistically significant differences between groups are indicated via bold orange edges, while nonsignificant edges are colored gray. Green boxes at the right side of panels A, B, and C summarize pairwise beta diversity distances within the group of “good” samples, and the dark blue boxes represent distances within “poor” samples. The cyan-colored boxes show between-group distances, i.e., all pairwise distances between “good” and “poor” samples. Similarly, the green, dark blue, and cyan boxes in panels E (closed-reference OTU picking), F (open-reference OTU picking), and G (Deblur) summarize pairwise distances within “adult” and “hatch year” data and between samples, respectively, and correlation via Mantel tests for unweighted Unifrac beta diversities were computed (H).
FIG 4
FIG 4
Deviations between de novo or insertion trees and gold standard trees. For 100 iterations, we randomly chose 10,000 150-nt V4 fragments to split the Greengenes tree into training and testing trees. Phylogenies for the 10,000 fragments were constructed via QIIME2’s de novo recommendations and SEPP. For both metrics, the insertion trees were significantly (two-sided Mann-Whitney tests; P < 10−32) closer to the gold standard than the de novo trees. The tip-to-tip distance summarizes the similarity of two trees as the Pearson correlation coefficient of two sets of path lengths, where pairs with tips not present in both trees are omitted. Those two sets are independently enumerated as pairwise tip-to-tip path lengths for each tree.
FIG 5
FIG 5
Perfectly matching fragments are precisely inserted below the species level. We extracted all possible (n = 208,255) unique V4 150-nt fragments from Greengenes reference alignments and reinserted those into the Greengenes 99% sequence identity reference phylogenetic tree, which is based on 1,261,500 full-length ribosomal sequences. Due to trimming, many full-length sequences map to the same fragment. (A) Taxonomic diversity by rank to establish reference coordinates. (B) Insertion error for V4 fragments as the path length from the inserted position in the tree to the lowest common ancestor (lca) of all true OTU tips. x-axis data denote ambiguity, i.e., the number of originating OTUs for a fragment; note the binning for more than 7 true OTUs. Blue bars indicate fragments that map only to representative sequences, while green bars show results for fragments that also map to the majority of nonrepresentative sequences. (C) A histogram for fragment distribution by ambiguity and representativeness.
FIG 6
FIG 6
Insertion errors are not equally distributed across the reference phylogeny. y-axis data show the mean insertion distance for unambiguous 150-nt V4 fragments grouped by phylum of the true OTUs. Numbers of taxa within phyla are indicated as numbers following phylum names.
FIG 7
FIG 7
Insertion distance grows linearly with the number of point mutations. (A) Taxonomic diversity reference data were determined as described for Fig. 5. (B) Insertion errors as the path length from insertion to single true OTU node for fragments with up to 10 point mutations.
FIG 8
FIG 8
Comparison of insertion errors made by SEPP and SortMeRNA. The reference alignment and tree were randomly split into 10% testing and 90% training sequences. V4 fragments (150 nt) were generated from the test sequences and reinserted via SEPP or aligned via SortMeRNA. (A) Taxonomic diversity by rank to establish reference coordinates. (B) Insertion errors for SEPP and SortMeRNA between the true and assigned positions in the tree. (C) A histogram for fragment distribution by method. Note that SortMeRNA rejected more fragments than SEPP.
FIG 9
FIG 9
Meta-analyses of two microbiome studies with heterogeneous variable 16S regions. (A) De novo tree construction resulted in strong artifacts in the PCoA space (see black arrow). (B) Insertion of heterogeneous sOTUs into the same backbone tree via SEPP resolved the artifact and enabled meaningful insights. (C) Available V2 reads from the “Yanomami” samples served as a positive control. Separation of samples from the two studies was indeed driven by body product and not by different sequencing parameters.
FIG 10
FIG 10
Empirical speedup of SEPP in HPC environments. For a data set with 50,000 fragments, SEPP is used with various numbers of cores on one node of the Comet supercomputing cluster to place fragments into the 99% Greengenes reference tree. The running time starts with 8 h with one thread and continues to decrease with increased numbers of threads. The unit line is shown as a dotted red line.
FIG 11
FIG 11
Fragment generation. We degapped a 150-nt V4 region of the PyNAST alignment (from column 2,263 throughout 3,794), trimmed sequences that were too long, and discarded sequences that were too short. Dereplication resulted in 208,255 (green) unique 150-nt V4 fragments.
FIG 12
FIG 12
Tree constructions for random reinsertions.

Similar articles

Cited by

References

    1. Amir A, McDonald D, Navas-Molina JA, Kopylova E, Morton JT, Zech Xu Z, Kightley EP, Thompson LR, Hyde ER, Gonzalez A, Knight R. 2017. Deblur rapidly resolves single-nucleotide community sequence patterns. MSystems 2:e00191-16. doi:10.1128/mSystems.00191-16. - DOI - PMC - PubMed
    1. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. 2016. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 13:581–583. doi:10.1038/nmeth.3869. - DOI - PMC - PubMed
    1. Ragan-Kelley B, Walters WA, McDonald D, Riley J, Granger BE, Gonzalez A, Knight R, Perez F, Caporaso JG. 2013. Collaborative cloud-enabled tools allow rapid, reproducible biological insights. ISME J 7:461–464. doi:10.1038/ismej.2012.123. - DOI - PMC - PubMed
    1. Moret BME, Roshan U, Warnow T. 2002. Sequence-length requirements for phylogenetic methods. Lecture Notes Comput Sci 343–356. doi:10.1007/3-540-45784-4_26. - DOI
    1. Faith DP. 1992. Conservation evaluation and phylogenetic diversity. Biol Conserv 61:1–10. doi:10.1016/0006-3207(92)91201-3. - DOI