Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(2):e31386.
doi: 10.1371/journal.pone.0031386. Epub 2012 Feb 23.

Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data

Affiliations
Free PMC article

Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data

Daniel R Mende et al. PLoS One. .
Free PMC article

Erratum in

  • PLoS One. 2014;9(11):e114063

Abstract

Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Comparison of Assemblies of Illumina Data with and without Quality Control.
Contig length histograms illustrate the number of contigs within a certain size fraction for assemblies of Illumina reads with quality filtering (red) and without quality filtering (purple). Contig lengths were compared for assemblies of different community complexities (10 genomes, 100 genomes, 400 genomes). Only contigs greater than 500 bp are shown, the x-axis is log scale. There was a strong improvement in the assemblies with pre-assembly quality control of the reads.
Figure 2
Figure 2. Comparison of assemblies from different sequencing technologies.
a) Contig Length Distribution. Histograms of the contig lengths illustrate the number of contigs within a certain size fraction for assemblies of Illumina reads with quality filtering (red), Sanger sequenced reads (yellow) and reads from pyrosequencing (blue). Only contigs greater than 500 bp are shown, the x-axis is log scale. Assemblies were generated for different community complexities (10 genomes, 100 genomes, 400 genomes). b) Overall Accuracy of the Contigs. The overall accuracy of the contigs is summarized using different measures of chimericity. Bars to the left illustrate the percentage of all of the contigs that are chimeric, bars in the middle show the percentage of all of the contigs that have a Contig Score less than 95%, and to the right contigs that have a Contig Score less than 99%. Contig Score represents the percent identity between the contig and the derived reference genome. Contigs from Illumina reads are red, contigs from Sanger reads are yellow and contigs from pyrosequencing are blue. In general there was a slightly higher proportion of Illumina contigs that were chimeric, however they had higher contig scores. c) Contig Accuracy across Contig Lengths. These combined strip plots show the degree of chimericity (upper plot) and contig score (lower plot) for each contig in the assemblies, each dot represents one contig. They are grouped into size bins. The degree of chimericity is the proportion of reads in a contig that are derived from the ‘wrong’ genome and thus make the contig chimeric. Contig Score represents the percent identity between the contig and the derived reference genome. Again contigs from Illumina assemblies are in red, from Sanger assemblies are in yellow and from pyrosequencing assemblies are in blue. For all sequencing technologies and communities, longer contigs are more accurate.
Figure 3
Figure 3. Comparison of assemblies of Illumina contigs and Illumina scaftigs.
Scaffolds are constructed by linking contigs using information from paired end reads, during this process a number of unknown bases are usually found between the sequences of the linked contigs. To use the information obtained by scaffolding, Scaftigs can be constructed by extracting the contiguous sequences that lack unknown bases (Ns). a) Contig Length Distribution. Histograms of the contig lengths illustrate the number of contigs within a certain size fraction for assemblies of Illumina contigs (red) and Illumina scaftigs (light blue). Only contigs greater than 500 bp are shown, the x-axis is log scale. Assemblies were generated for different community complexities (10 genomes, 100 genomes, 400 genomes). b) Overall Accuracy of the Contigs. The overall accuracy of the contigs is summarized using different measures of chimericity. Bars to the left illustrate the percentage of all of the contigs that are chimeric, bars in the middle show the percentage of all of the contigs that have a Contig Score less than 95%, and to the right contigs that have a Contig Score less than 99%. Contig Score represents the percent identity between the contig and the derived reference genome. Illumina contigs are in red Illumina scaftigs are blue. c) Contig Accuracy across Contig Lengths. These combined strip plots show the degree of chimericity (upper plot) and contig score (lower plot) for each contig in the assemblies, each dot represents one contig. They are grouped into size bins. The degree of chimericity is the proportion of reads in a contig that are derived from the ‘wrong’ genome and thus make the contig chimeric. Contig Score represents the percent identity between the contig and the derived reference genome. Again Illumina contigs are in red and Illumina scaftigs are in blue.
Figure 4
Figure 4. Comparison of the functional repertoire of the metagenomes to each other and to the expected.
a) Correlations between expected and actual COG abundance. Dotplots compare the expected and actual abundance for each COG, with the x-axis displaying the COG abundances as expected from the input genomes and the y-axis displaying the COG abundances as determined from assembly and annotation of the simulated metagenomes. The black line shows the 1∶1 correlation. The Pearson correlation coefficients are displayed for each dataset. b) Principal Coordinate Analysis (PCoA) and Principal Component Analysis (PCA). The COG abundance profiles were compared to each other using Jensen-Shannon divergence and the distance matrix was then analyzed plotted using PCoA. The COG abundance profiles were analyzed plotted using PCA. The dots are colored by sequencing method: Illumina contigs (red), Illumina scaftigs (light blue), Sanger (yellow) and pyrosequencing (blue).

Similar articles

See all similar articles

Cited by 65 articles

See all "Cited by" articles

References

    1. Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemistry & Biology. 1998;5:R245–R249. - PubMed
    1. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science (New York, NY) 2004;304:66–74. - PubMed
    1. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, et al. The marine viromes of four oceanic regions. PLoS biology. 2006;4:e368. - PMC - PubMed
    1. Abulencia CB, Wyborski DL, Garcia JA, Podar M, Chen W, et al. Environmental whole-genome amplification to access microbial populations in contaminated sediments. Applied and environmental microbiology. 2006;72:3291–3301. - PMC - PubMed
    1. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al. Comparative metagenomics of microbial communities. Science (New York, NY) 2005;308:554–557. - PubMed

Publication types

Feedback