Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer

Cell Syst. 2021 Oct 20;12(10):958-968.e6. doi: 10.1016/j.cels.2021.08.009. Epub 2021 Sep 14.

Abstract

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. Here, we define an algorithmic approach, mdBG, that makes use of minimizer-space de Bruijn graphs to enable long-read genome assembly. mdBG achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without compromising accuracy. A human genome is assembled in under 10 min using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 min using 1 GB RAM. In addition, we constructed a minimizer-space de Bruijn graph-based representation of 661,405 bacterial genomes, comprising 16 million nodes and 45 million edges, and successfully search it for anti-microbial resistance (AMR) genes in 12 min. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics, and pangenomics. Code for constructing mdBGs is freely available for download at https://github.com/ekimb/rust-mdbg/.

Keywords: bacterial genomes; data structures; de Bruijn graphs; genome assembly; genome graphs; long-read sequencing; metagenomics; minimizers; pangenomics; partial order alignment.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Genomics*
  • Humans
  • Metagenomics
  • Microcomputers
  • Sequence Analysis, DNA / methods