Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Sep 16;8(1):22.
doi: 10.1186/1748-7188-8-22.

Space-efficient and Exact De Bruijn Graph Representation Based on a Bloom Filter

Affiliations
Free PMC article

Space-efficient and Exact De Bruijn Graph Representation Based on a Bloom Filter

Rayan Chikhi et al. Algorithms Mol Biol. .
Free PMC article

Abstract

Background: The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g. de novo assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥30 GB).

Results: We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives.

Conclusions: An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.

Figures

Figure 1
Figure 1
A complete example of removing false positives in the probabilistic de Bruijn graph. (a) showsS, an example de Bruijn graph (the 7 non-dashed nodes), and, its probabilistic representation from a Bloom filter (taking the union of all nodes). Dashed rectangular nodes (in red in the electronic version) are immediate neighbors ofS in. These nodes are the critical false positives. Dashed circular nodes (in green) are all the other nodes of; (b) shows a sample of the hash values associates to the nodes ofS (a toy hash function is used); (c) shows the complete Bloom filter associated toS; incidentally, the nodes of are exactly those to which the Bloom filter answers positively; (d) describes the lower bound for exactly encoding the nodes ofS (self-information) and the space required to encode our structure (Bloom filter, 10 bits, and 3 critical false positives, 6 bits per 3-mer).
Figure 2
Figure 2
Optimal data structure size for the parameterr, then for the parameterk.(a) Structure size (Bloom filter, critical false positives) in function of the number of bits per k-mer allocated to the Bloom filter (also called ratio r) for k = 27. The trade-off that optimizes the total size is shown in dashed lines. (b) Optimal size of the structure for different values of k.
Figure 3
Figure 3
Data structure sizes for the probabilistic de Bruijn graph. Data structure sizes (Bloom filter, marking structure, and cFP if applicable) for the probabilistic de Bruijn graph with (top right) and without the cFP structure (top left), for an actual dataset (E. coli, k = 23). All plots are in function of the number of bits per k-mer allocated to the Bloom filter. Additionally, the difference is shown (bottom left and bottom right) between a reference assembly made using an exact de Bruijn graph, and an assembly made with each structure.

Similar articles

See all similar articles

Cited by 73 articles

See all "Cited by" articles

References

    1. Idury RM, Waterman MS. A new algorithm for DNA sequence assembly. J Comput Biol. 1995;8(2):291–306. doi: 10.1089/cmb.1995.2.291. - DOI - PubMed
    1. Grabherr MG. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech. 2011;8(7):644–652. doi: 10.1038/nbt.1883. [ http://dx.doi.org/10.1038/nbt.1883] - DOI - DOI - PMC - PubMed
    1. Peng Y, Leung HCM, Yiu SM, Chin FYL. Meta-IDBA: a de Novo assembler for metagenomic data. Bioinformatics. 2011;8(13):i94–i101. doi: 10.1093/bioinformatics/btr216. - DOI - PMC - PubMed
    1. Peterlongo P, Schnel N, Pisanti N, Sagot MF, Lacroix V. String Processing and Information Retrieval. Berlin, Heidelberg: Springer; 2010. Identifying SNPs without a reference genome by comparing raw reads; pp. 147–158.
    1. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012;8:226–232. doi: 10.1038/ng.1028. - DOI - PMC - PubMed

LinkOut - more resources

Feedback