Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

Sergey Koren; Brian P Walenz; Konstantin Berlin; Jason R Miller; Nicholas H Bergman; Adam M Phillippy

doi:10.1101/gr.215087.116

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

Genome Res. 2017 May;27(5):722-736. doi: 10.1101/gr.215087.116. Epub 2017 Mar 15.

Authors

Sergey Koren¹, Brian P Walenz¹, Konstantin Berlin², Jason R Miller³, Nicholas H Bergman⁴, Adam M Phillippy¹

Affiliations

¹ Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
² Invincea Incorporated, Fairfax, Virginia 22030, USA.
³ J. Craig Venter Institute, Rockville, Maryland 20850, USA.
⁴ National Biodefense Analysis and Countermeasures Center, Frederick, Maryland 21702, USA.

Abstract

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

Publication types

Research Support, N.I.H., Intramural
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Animals
Contig Mapping / methods*
Contig Mapping / standards
Drosophila melanogaster / genetics
Genome, Bacterial
Genomics / methods*
Genomics / standards
Humans
Repetitive Sequences, Nucleic Acid
Sequence Analysis, DNA / methods*
Sequence Analysis, DNA / standards
Software*