Estimating the composition of species in metagenomes by clustering of next-generation read sequences

Methods. 2014 Oct 1;69(3):213-9. doi: 10.1016/j.ymeth.2014.07.009. Epub 2014 Jul 27.


Faster and cheaper sequencing technologies together with the ability to sequence uncultured microbes collected from any environment present us an opportunity to distill meaningful information from the millions of new genomic sequences from environmental samples, called metagenome. Contrary to conventional cultured microbes, however, the metagenomic data is extremely heterogeneous and noisy. Therefore the separation of the sets of sequenced genomic fragments that belong to different microbes is essential for successful assembly of microbial genomes. In this paper, we present a novel clustering method for a given metagenomic dataset. The metagenomic dataset has some distinguished features because (i) it is possible that similar sequence patterns may exist in different species and (ii) each species has different number of individuals in the given metagenomic dataset. Our method overcomes these obstacles by using the Gaussian mixture model and analysis of mixture profiles, and taking advantage of genomic signatures extracted from the metagenomic dataset. Unlike conventional clustering methods where clusters are discovered through global similarities of data instances, our method builds clusters by combining the data instances sharing local similarities captured by mixture analysis. By considering shared mixture components, our method is able to create clusters of genomic sequences although they are globally distinct each other. We applied our method to an artificial metagenomic dataset comprised of simulated 47 million reads from 25 real microbial genomes, and analyzed the resulting clusters in terms of the number of clusters, the number of participating species and dominant species in each cluster. Even though our approach cannot address all challenges in the field of metagenome sequence clustering, we believe that out method can contribute to take a step forward to achieve the goals.

Keywords: Clustering; Gaussian mixture; Metagenome; Species estimation.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Base Sequence
  • Cluster Analysis
  • Computational Biology / methods*
  • Genomics
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Metagenome*
  • Sequence Analysis, DNA / methods*