Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Oct;33(10):1053-60.
doi: 10.1038/nbt.3329. Epub 2015 Sep 14.

Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning

Affiliations

Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning

Brian Cleary et al. Nat Biotechnol. 2015 Oct.

Abstract

Analyses of metagenomic datasets that are sequenced to a depth of billions or trillions of bases can uncover hundreds of microbial genomes, but naive assembly of these data is computationally intensive, requiring hundreds of gigabytes to terabytes of RAM. We present latent strain analysis (LSA), a scalable, de novo pre-assembly method that separates reads into biologically informed partitions and thereby enables assembly of individual genomes. LSA is implemented with a streaming calculation of unobserved variables that we call eigengenomes. Eigengenomes reflect covariance in the abundance of short, fixed-length sequences, or k-mers. As the abundance of each genome in a sample is reflected in the abundance of each k-mer in that genome, eigengenome analysis can be used to partition reads from different genomes. This partitioning can be done in fixed memory using tens of gigabytes of RAM, which makes assembly and downstream analyses of terabytes of data feasible on commodity hardware. Using LSA, we assemble partial and near-complete genomes of bacterial taxa present at relative abundances as low as 0.00001%. We also show that LSA is sensitive enough to separate reads from several strains of the same species.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Accuracy and completeness of recovered genomes. The accuracy of Salmonella enriched partitions (rows) with respect to each strain (columns) is depicted on a color scale. Saturation of each color indicates the completeness of each assembly with respect to each strain. Bars in the two right panels indicate the fraction of reads in a partition coming from any Salmonella strain (red line = 5%; the background abundance of spiked-in Salmonella reads), and the total assembly length. The tree at the top was constructed using MUMi distance between strains.
Figure 2
Figure 2
Salmonella enterica multiple genome alignment. Multiple sequence alignment (MSA) blocks (gray ring) are ordered by their conservation across 1–7 strains. The inner rings depict portions of each genome that align to each MSA block. Within 5 Salmonella enterica-enriched partitions, the read depth at each M SA block is shown as a heatmap in the outer rings. Partition numbers from the center, outwards are: 1424, 56, 86, 1369, and 1093.
Figure 3
Figure 3
Latent Semantic Analysis pipeline. Metagenomic samples containing multiple species (depicted by different colors) are sequenced. Every k-mer in every sequencing read is hashed to one column of a matrix. Values from each sample occupy a different row. Singular value decomposition of this k-mer abundance matrix defines a set of eigengenomes. K-mers are clustered across eigengenomes, and each read is partitioned based on the intersection of it’s k-mers with each of these clusters. Each partition contains a small fraction of the original data, and can be analyzed independent of all others.
Figure 4
Figure 4
Enrichment of bacterial families spanning six orders of magnitude in abundance. Each circle represents one family in the FijiCoMP-stool collection. The x-axis is the background (unpartitioned) abundance of each family, as determined by species-specific 16S ribosomal DNA. Y-axis values are the maximum relative abundance in any one partition, as measured by MetaPhyler analysis of marker genes. Circle size is determined by the number of AMPHORA genes in the assembly of each partition.
Figure 5
Figure 5
GC Content versus Contig Depth. Plotted are the GC content (x-axis) and depth (y-axis) for contigs in partitions representing the top 15 enriched families from the FijiCoMP collection. Alignments to different families are depicted in different colors, and the size of each circle represents the length of each contig. For each family the background abundance is indicated in parentheses.

Comment in

  • Strain recovery from metagenomes.
    Brown CT. Brown CT. Nat Biotechnol. 2015 Oct;33(10):1041-3. doi: 10.1038/nbt.3375. Nat Biotechnol. 2015. PMID: 26448087 No abstract available.

Similar articles

Cited by

References

    1. Fierer N, et al. Metagenomic and small-subunit rRNA analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil. Appl. Environ. Microbiol. 2007;73:7059–7066. - PMC - PubMed
    1. Koren O, et al. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS computational biology. 2013;(1) - PMC - PubMed
    1. Gans J, Wolinsky M, Dunbar J. Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science. 2005;309:1387–1390. - PubMed
    1. Tringe SG, et al. Comparative metagenomics of microbial communities. Science. 2005;308:554–557. - PubMed
    1. Daniel R. The metagenomics of soil. Nat. Rev. Microbiol. 2005;3:470–478. - PubMed

Publication types