Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec;21(12):2224-41.
doi: 10.1101/gr.126599.111. Epub 2011 Sep 16.

Assemblathon 1: A Competitive Assessment of De Novo Short Read Assembly Methods

Affiliations
Free PMC article

Assemblathon 1: A Competitive Assessment of De Novo Short Read Assembly Methods

Dent Earl et al. Genome Res. .
Free PMC article

Abstract

Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.

Figures

Figure 1.
Figure 1.
The phylogeny of the simulated haploid genomes. The root genome derives from human chromosome 13. The α1 and α2 haplotypes form the diploid genome from which we generated reads. The β1 and β2 haplotypes form a diploid out-group genome that was made available to the assemblers.
Figure 2.
Figure 2.
N50 statistics. Assemblies are sorted left to right in descending order by scaffold path NG50. Data points for each assembly are slightly offset along the x-axis in order to show overlaps.
Figure 3.
Figure 3.
An adjacency graph example demonstrating threads, contig paths, and scaffold paths. Each stack of boxes represents a block edge. The nodes of the graph are represented by the left and right ends of the stacked boxes. The adjacency edges are groups of lines that connect the ends of the stacked boxes. Threads are represented (inset) within the graph as alternating connected boxes and colored lines. There are three threads shown: (top to bottom) black, gray, and light gray. The black and gray threads represent two haplotypes; there are many alternative haplotype threads that result from a mixture of these haplotype segments, which are equally plausible given no additional information to deconvolve them. The light-gray thread represents an assembly sequence. For the assembly thread, consistent adjacencies are shown in solid light gray. The dashed light gray line between the right end of block g and the left end of block i represents a structural error (deletion). The dashed light-gray line between the right end of block k and the left end of block m represents a scaffold gap, because the segment of the assembly in block n contains wild-card characters. The example, therefore, contains three contig paths: (from left to right) blocks a…g ACTGAAATCGGGACCCC; blocks i, j, k GGAAC; and block m CC. However, the example contains only two scaffold paths because the latter two contig paths are concatenated to form one scaffold path.
Figure 4.
Figure 4.
Assembly coverage along haplotype α1 stratified by scaffold path length weighted overall coverage. The top six rows show density plots of annotations. (CDS) Coding sequence; (UTR) untranslated region; (NXE) nonexonic conserved regions within genes; (NGE) nongenic conserved regions; (island) CpG islands; (repeats) repetitive elements. The remaining rows show the top-ranked assembly from each group, sorted by scaffold path length weighted overall coverage. Each such row is a density plot of the coverage, with colored stack fills used to show the length of scaffold paths mapped to a given location in the haplotype. For example, the left-most light-orange block of the WTSI-S assembly row represents a region of haplotype α1 that is almost completely covered by a scaffold path from the WTSI-S assembly greater than one megabase in length.
Figure 5.
Figure 5.
The proportion of correctly contiguous pairs as a function of their separation distance. Each line represents the top assembly from each team. Correctly contiguous 50 (CC50) values are the lowest point of each line. The legend is ordered top to bottom in descending order of CC50. Proportions were calculated by taking 100,000,000 random samples and binning them into 2000 bins, equally spaced along a log10 scale, so that an approximately equal number of samples fell in each bin.
Figure 6.
Figure 6.
Substitution (base) errors for the top assembly from each team. (Top) Substitution errors per correct bit within all valid columns; (middle) substitution errors per correct bit within homozygous columns only; (bottom) substitution errors per correct bit within heterozygous columns only. Assemblies are sorted from left to right in ascending order by the sum of substitutions per correct bit. In each faceted plot, each assembly is shown as an interval, giving the upper and lower bounds on the numbers of substitution errors (see main text).
Figure 7.
Figure 7.
Copy-number errors for the top assembly from each team. (Top) Proportion of haplotype containing columns with a copy-number error; (middle) proportion of haplotype containing columns with an excess copy-number error; (bottom) proportion of haplotype containing columns with an excess copy-number error. Assemblies are sorted from left to right in ascending order according to the proportion of haplotype containing columns with a copy-number error. In each faceted plot, each assembly is shown as an interval, giving the upper and lower bounds on the numbers of copy-number errors (see main text).
Figure 8.
Figure 8.
Scaffold gap and error subgraphs. Diagrams follow the format of Figure 3. The rounded boxes represent extensions to the surrounding threads. Line ends not incident with the edge of boxes represent the continuation of a thread unseen. In each diagram the right end of block a and the left end of block b (if present) represent the ends of contig paths, and the enclosed gray thread represents the joining thread. The black thread represents a haplotype thread. The gray thread represents either a haplotype or bacterial contamination thread. (A) (Hanging) scaffold gaps and hanging insert errors. (B) Scaffold gaps and indel errors. (C) Intra- and interchromosomal joining errors and haplotype to contamination joining errors.

Similar articles

See all similar articles

Cited by 196 articles

See all "Cited by" articles

Publication types

MeSH terms

LinkOut - more resources

Feedback