Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Sep 21;18(1):181.
doi: 10.1186/s13059-017-1309-9.

DESMAN: a new tool for de novo extraction of strains from metagenomes

Affiliations

DESMAN: a new tool for de novo extraction of strains from metagenomes

Christopher Quince et al. Genome Biol. .

Abstract

We introduce DESMAN for De novo Extraction of Strains from Metagenomes. Large multi-sample metagenomes are being generated but strain variation results in fragmentary co-assemblies. Current algorithms can bin contigs into metagenome-assembled genomes but are unable to resolve strain-level variation. DESMAN identifies variants in core genes and uses co-occurrence across samples to link variants into haplotypes and abundance profiles. These are then searched for against non-core genes to determine the accessory genome of each strain. We validated DESMAN on a complex 50-species 210-genome 96-sample synthetic mock data set and then applied it to the Tara Oceans microbiome.

Keywords: Metagenomes; Niche; Strain.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

No ethical approval was necessary for this study.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Summary of the DESMAN pipeline. A full description of the statistics and bioinformatics underlying DESMAN is given in ‘Methods’. The software itself is available open source from https://github.com/chrisquince/ DESMAN. COG cluster of orthologous groups of proteins, SCSG single-copy core species gene Tetranucleotide Frequencies (TNF)
Fig. 2
Fig. 2
a Posterior mean deviance for different strain numbers, G, for the synthetic strain mock Escherichia coli SCSG positions. We ran five replicates of the Gibbs sampler at each value of G on 1,000 random positions from the 6,044 variants identified. b SNV accuracy as a function of sample number. The number of incorrectly inferred SNVs averaged across all five strains and 20 replicates of a random subset of the 64 samples. c Comparison of true E. coli strain frequency vs. DESMAN predictions. We compare the known E. coli strain frequencies as relative coverage against the frequencies in each sample of the DESMAN-predicted haplotype it mapped onto (R 2=0.9998, p-value <2.2×10−16). d Comparison of gene presence inferred for the haplotypes and the known assignment of genes to strain genomes. Gene presence/absence was inferred for the haplotypes using Eq. 8 and compared to known references. Overall accuracy was 95.7%. These results were for the run with G=5, which had the lowest posterior mean deviance. E. coli Escherichia coli, SNP single-nucleotide polymorphism, SNV single-nucleotide variant
Fig. 3
Fig. 3
Validation of reconstructed strains for the Escherichia coli O104:H4 outbreak. a The mean SNV uncertainty, i.e. the proportion of SNVs that a strain differs from its closest match in a replicate run, averaged over all the other replicates. This is shown on the y-axis against mean relative abundance across samples on the x-axis. b Phylogenetic tree constructed for the eight inferred strains found for the E. coli O104:H4 outbreak. The SCSGs for the strains and reference genomes were aligned separately using mafft [50], trimmed and then concatenated together. The tree was constructed using FastTree [51]. Inferred strains are shown as magenta, O104:H4 strains in red and uropathogenic E. coli in blue. Both results were for the run with G=8 that had the lowest posterior mean deviance. SNP single-nucleotide polymorphism, SNV single-nucleotide variant
Fig. 4
Fig. 4
a Variant detection for the 75 CONCOCT clusters of complex strain mock that were 75% pure and complete. Here, 36 clusters (shown) had variants, and 27 of these mapped onto multi-strain species enabling us to calculate variants that were present in the species (true positives or TPs), the number detected not in the species (false positives or FPs) and the number we failed to detect (false negatives or FNs). b Haplotype inference accuracy. For the 25 75% complete CONCOCT clusters that possessed variants and mapped onto species with strain variation, we plot the true number of strains (x-axis) against the inferred number (y-axis), with random jitter to distinguish data points. The colour reflects the mean error rate in SNV predictions on single-copy core genes (Err) and the size the total coverage of the cluster (see Additional file 1: Table S8 for actual values). c Comparison of the true relative strain frequency and inferred haplotype frequency across the 96 samples for the complex strain mock. The data points are coloured by the SNV error rate (E) in the haplotype prediction. (Linear regression of true vs. predicted frequency all: slope = 0.820, adjusted R-squared = 0.741, p-value = <2.2×10−16; haplotypes with E<0.01: slope = 0.853, adjusted R-squared = 0.810, p-value <2.2×10−16.) d Haplotype SNV error vs. gene presence/absence inference error rate. For each of the 67 inferred haplotypes, we give the SNV error rate on single-copy core genes to the closest reference strain against the error rate in the prediction of gene presence/absence in that strain. Cov coverage, Err error, FN false negative, FP false positive, SNP single-nucleotide polymorphism, SNV single-nucleotide variant, TP true positive
Fig. 5
Fig. 5
Top panel: Number of haplotypes inferred by DESMAN as a function of MAG genome length. A significant negative correlation was observed (Spearman’s test, ρ=−0.569, p-value = 0.000068). Bottom panel: SCG nucleotide divergence against genome divergence for the Tara haplotypes separated by MAG length. This gives the fractional divergence in SNVs between every pair of haplotypes (I) against the fractional divergence in 5% gene clusters across the whole genome (C). Data points are divided according to whether they derived from a MAG with genome length <1 Mbp. In a linear regression of genome divergence against nucleotide divergence, whether a MAG was <1 Mbp was a significant interaction (slope =0.11±0.02, p-value =5.95×10−9; slope interaction small = TRUE, 0.33±0.07, p-value = 3.51×10−6; overall adjusted R-squared = 0.6021, p-value =1.786×10−12). MAG metagenome-assembled genome, SCG single-copy core gene, SNV single-nucleotide variant
Fig. 6
Fig. 6
Geographic distribution of TARA_MED_MAG_00110 haplotypes. Top panel: Box plot of each haplotype’s relative abundance across the 11 regions where more than one sample had coverage greater than one. Bottom panel: The top left subpanel gives the total normalised relative abundance of the entire MAG. The other three subpanels give relative haplotype abundance for the three confidently inferred variants within this MAG. Results are shown for the 33 of 61 surface samples for which this MAG had coverage greater than 1. All three haplotypes were significantly associated with geographic region based on Kruskal–Wallis ANOVAs (H2: χ 2=20.9, p-value = 0.0074; H3: χ 2=17.8, p-value = 0.023; H4: χ 2=23.1, p-value = 0.0032). MAG metagenome-assembled genome, Mediterranean (MED), Athlantic South-West (ASW), Indian Ocean North (ION), Pacific South-East (PSE), Pacific South-West (PSW), Indian Ocean South (IOS), Pacific Ocean North (PON), Red Sea (RED)

Similar articles

Cited by

References

    1. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012;9:811–4. doi: 10.1038/nmeth.2066. - DOI - PMC - PubMed
    1. Scholz M, Ward DV, Pasolli E, Tolio T, Zolfo M, Asnicar F, et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat Methods. 2016;13(5):435–8. doi: 10.1038/nmeth.3802. - DOI - PubMed
    1. Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A, et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature. 2015;523:208–11. doi: 10.1038/nature14486. - DOI - PubMed
    1. Pevzner P, Tang H, Waterman M. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001;98:9748–53. doi: 10.1073/pnas.171285098. - DOI - PMC - PubMed
    1. Kelley DR, Salzberg SL. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinforma. 2010;11:544. doi: 10.1186/1471-2105-11-544. - DOI - PMC - PubMed

Publication types

LinkOut - more resources