Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 7 (9), e42304

An Integrated Pipeline for De Novo Assembly of Microbial Genomes

Affiliations

An Integrated Pipeline for De Novo Assembly of Microbial Genomes

Andrew Tritt et al. PLoS One.

Abstract

Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron's Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Sequence length accumulation curve for six assemblies of the model archaeon H. volcanii DS2.
Curves represent the number of bases in an assembly as a function of the formula image largest sequences. Assemblies generated from SOAPdenovo and A5 are labelled with “SOAP” and “A5”, respectively. “scaf” indicates an assembly that has been scaffolded, while “ctg” indicates no scaffolding. For A5, assembly “scaf-QC” has been broken using the A5QC algorithm and rescaffolded using SSPACE. A perfect assembly would have exactly the number of sequences as the organism has replicons (5 in this case), and the curve would be in the extreme upper left corner.
Figure 2
Figure 2. Overview of the stages in A5.
The first stage of the A5 pipeline cleans reads, removing any contaminant reads and correcting base-call errors. Then the pipeline assembles contigs with IDBA using these error corrected reads. These contigs are then scaffolded using the original read set. Scaffolds are then checked for misassemblies, and broken at regions containing misassemblies. Finally, the broken scaffolds are rescaffolded using the original read set.
Figure 3
Figure 3. Demonstration of the automated misassembly quality control process.
Upper Left: A hypothetical whole genome alignment of an assembly containing misassemblies relative to the true genome, consisting of three circular chromosomes, and the resulting broken assembly. Red, green, and blue lines connect aligned regions. Black connecting lines represent real paired read connections between contigs and orange connecting lines represent erroneous connections. Black boxes in the broken assembly highlight blocks identified by the DBSCAN algorithm. Upper Right and Bottom Row: Plots of connected points between contigs. Black and orange dots correspond to black and orange connections lines from left figure, respectively. Dotted lines correspond to misassembly breakpoints. Gray circles highlight the set of points that are clustered by DBSCAN.

Similar articles

See all similar articles

Cited by 239 articles

See all "Cited by" articles

References

    1. Kao WC, Stevens K, Song YS (2009) Bayescall: A model-based basecalling algorithm for high-throughput short-read sequencing. Genome Research 19: 1884–1895. - PMC - PubMed
    1. Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the illumina genome analyzer using machine learning strategies. Genome Biology 10: R83. - PMC - PubMed
    1. Kelley D, Schatz M, Salzberg S (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biology 11: R116. - PMC - PubMed
    1. Kao WC, Chan AH, Song YS (2011) Echo: A reference-free short-read error correction algorithm. Genome Research 21: 1181–1192. - PMC - PubMed
    1. Lassmann T, Hayashizaki Y, Daub CO (2009) Tagdusta program to eliminate artifacts from next generation sequencing data. Bioinformatics 25: 2839–2840. - PMC - PubMed

Publication types

Grant support

This work was supported by National Science Foundation award ER 0949453. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources

Feedback