Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
- PMID: 28298431
- PMCID: PMC5411767
- DOI: 10.1101/gr.215087.116
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
Abstract
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
© 2017 Koren et al.; Published by Cold Spring Harbor Laboratory Press.
Figures
Similar articles
-
HINGE: long-read assembly achieves optimal repeat resolution.Genome Res. 2017 May;27(5):747-756. doi: 10.1101/gr.216465.116. Epub 2017 Mar 20. Genome Res. 2017. PMID: 28320918 Free PMC article.
-
Improved assembly of noisy long reads by k-mer validation.Genome Res. 2016 Dec;26(12):1710-1720. doi: 10.1101/gr.209247.116. Epub 2016 Oct 7. Genome Res. 2016. PMID: 27831497 Free PMC article.
-
Fast and accurate de novo genome assembly from long uncorrected reads.Genome Res. 2017 May;27(5):737-746. doi: 10.1101/gr.214270.116. Epub 2017 Jan 18. Genome Res. 2017. PMID: 28100585 Free PMC article.
-
One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly.Curr Opin Microbiol. 2015 Feb;23:110-20. doi: 10.1016/j.mib.2014.11.014. Epub 2014 Dec 1. Curr Opin Microbiol. 2015. PMID: 25461581 Review.
-
Plant genome sequence assembly in the era of long reads: Progress, challenges and future directions.Quant Plant Biol. 2022 Mar 11;3:e5. doi: 10.1017/qpb.2021.18. eCollection 2022. Quant Plant Biol. 2022. PMID: 37077982 Free PMC article. Review.
Cited by
-
Mobile Type VI secretion system loci of the gut Bacteroidales display extensive intra-ecosystem transfer, multi-species spread and geographical clustering.PLoS Genet. 2021 Apr 26;17(4):e1009541. doi: 10.1371/journal.pgen.1009541. eCollection 2021 Apr. PLoS Genet. 2021. PMID: 33901198 Free PMC article.
-
Analysis of the Coptis chinensis genome reveals the diversification of protoberberine-type alkaloids.Nat Commun. 2021 Jun 2;12(1):3276. doi: 10.1038/s41467-021-23611-0. Nat Commun. 2021. PMID: 34078898 Free PMC article.
-
Horizontal Gene Transfer of Genes Encoding Copper-Containing Membrane-Bound Monooxygenase (CuMMO) and Soluble Di-iron Monooxygenase (SDIMO) in Ethane- and Propane-Oxidizing Rhodococcus Bacteria.Appl Environ Microbiol. 2021 Jun 25;87(14):e0022721. doi: 10.1128/AEM.00227-21. Epub 2021 Jun 25. Appl Environ Microbiol. 2021. PMID: 33962978 Free PMC article.
-
Rapid evolution and host immunity drive the rise and fall of carbapenem resistance during an acute Pseudomonas aeruginosa infection.Nat Commun. 2021 Apr 28;12(1):2460. doi: 10.1038/s41467-021-22814-9. Nat Commun. 2021. PMID: 33911082 Free PMC article.
-
Homopolish: a method for the removal of systematic errors in nanopore sequencing by homologous polishing.Genome Biol. 2021 Mar 31;22(1):95. doi: 10.1186/s13059-021-02282-6. Genome Biol. 2021. PMID: 33789731 Free PMC article.
References
-
- Berlin K, Koren S, Chin CS, Drake JP, Landolin JM, Phillippy AM. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33: 623–630. - PubMed
-
- Böhringer S, Gödde R, Böhringer D, Schulte T, Epplen JT. 2002. A software package for drawing ideograms automatically. Online J Bioinformatics 1: 51–61.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Miscellaneous