A de Bruijn graph approach to the quantification of closely-related genomes in a microbial community

J Comput Biol. 2012 Jun;19(6):814-25. doi: 10.1089/cmb.2012.0058.


The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Chromosome Mapping / methods*
  • Escherichia coli / classification
  • Escherichia coli / genetics*
  • Genome, Bacterial*
  • Genomic Structural Variation
  • Metagenomics
  • Microbial Consortia
  • Phylogeny
  • Sequence Alignment
  • Sequence Analysis, DNA
  • Treponema / classification
  • Treponema / genetics*