Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jun 30:12:271.
doi: 10.1186/1471-2105-12-271.

DNACLUST: accurate and efficient clustering of phylogenetic marker genes

Affiliations

DNACLUST: accurate and efficient clustering of phylogenetic marker genes

Mohammadreza Ghodsi et al. BMC Bioinformatics. .

Abstract

Background: Clustering is a fundamental operation in the analysis of biological sequence data. New DNA sequencing technologies have dramatically increased the rate at which we can generate data, resulting in datasets that cannot be efficiently analyzed by traditional clustering methods.This is particularly true in the context of taxonomic profiling of microbial communities through direct sequencing of phylogenetic markers (e.g. 16S rRNA) - the domain that motivated the work described in this paper. Many analysis approaches rely on an initial clustering step aimed at identifying sequences that belong to the same operational taxonomic unit (OTU). When defining OTUs (which have no universally accepted definition), scientists must balance a trade-off between computational efficiency and biological accuracy, as accurately estimating an environment's phylogenetic composition requires computationally-intensive analyses. We propose that efficient and mathematically well defined clustering methods can benefit existing taxonomic profiling approaches in two ways: (i) the resulting clusters can be substituted for OTUs in certain applications; and (ii) the clustering effectively reduces the size of the data-sets that need to be analyzed by complex phylogenetic pipelines (e.g., only one sequence per cluster needs to be provided to downstream analyses).

Results: To address the challenges outlined above, we developed DNACLUST, a fast clustering tool specifically designed for clustering highly-similar DNA sequences.Given a set of sequences and a sequence similarity threshold, DNACLUST creates clusters whose radius is guaranteed not to exceed the specified threshold. Underlying DNACLUST is a greedy clustering strategy that owes its performance to novel sequence alignment and k-mer based filtering algorithms.DNACLUST can also produce multiple sequence alignments for every cluster, allowing users to manually inspect clustering results, and enabling more detailed analyses of the clustered data.

Conclusions: We compare DNACLUST to two popular clustering tools: CD-HIT and UCLUST. We show that DNACLUST is about an order of magnitude faster than CD-HIT and UCLUST (exact mode) and comparable in speed to UCLUST (approximate mode). The performance of DNACLUST improves as the similarity threshold is increased (tight clusters) making it well suited for rapidly removing duplicates and near-duplicates from a dataset, thereby reducing the size of the data being analyzed through more elaborate approaches.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Dynamic programming table example. Partially filled dynamic programming table. The query sequence is represented on the horizontal axis. At this point the algorithm has computed the alignment costs for a prefix of length 4 of the data sequences - which is shown on the vertical axis. Since we are calculating semi-global alignment the first row is initialized to all zeroes, i.e. The alignment of the shorter data sequence can start at any position of the longer query sequence without any penalty. In this figure, the distance threshold is 2, and any values larger than this threshold are set to the maximum value represented by ∞. To optimize the running time, since there are only three valid values on the last finished row, only the values for the three gray cells need to be computed on row right above it.
Figure 2
Figure 2
Running times. Plot of running time as a function of cluster radius for various tools and settings, on the twins dataset. The dataset contains 1.1 million pyrosequencing reads from the V2 region of the 16S rRNA gene. The reads have an average length of 231 base pairs. The running times were measured on a single 1.8 GHz processor of an Intel x86-64 Linux laptop with 4 GB RAM. The command line options were: dnaclust infile.fasta -s 0.9x -k 3 [--approximate-filter] > outfile.cluster uclust --input infile-sorted.fasta --uc outfile.cluster --id 0.9x [--exact].
Figure 3
Figure 3
Distribution of cluster MSAs based on their average pairwise distance. Figures 3a, 3b and 3c show the distribution of sampled cluster multiple sequence alignments based on their average pairwise distance for thresholds 99%, 97% and 95%, respectively. The figures show that DNACLUST cluster MSAs (thick blue line) are tighter (i.e. have smaller average pairwise distance) than UCLUST cluster MSAs (thick red lines). Furthermore computing a "traditional" MSA using ClustalW from the clusters produced by DNACLUST and UCLUST results in an overestimation of the distances between sequences (dashed lines).

Similar articles

Cited by

References

    1. Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17(3):282. doi: 10.1093/bioinformatics/17.3.282. - DOI - PubMed
    1. Wang Q, Garrity G, Tiedje J, Cole J. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and environmental microbiology. 2007;73(16):5261. doi: 10.1128/AEM.00062-07. - DOI - PMC - PubMed
    1. Schloss P, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Applied and environmental microbiology. 2005;71(3):1501. doi: 10.1128/AEM.71.3.1501-1506.2005. - DOI - PMC - PubMed
    1. Schloss P, Westcott S, Ryabin T, Hall J, Hartmann M, Hollister E, Lesniewski R, Oakley B, Parks D, Robinson C. et al.Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and environmental microbiology. 2009;75(23):7537. doi: 10.1128/AEM.01541-09. - DOI - PMC - PubMed
    1. Felsenstein J. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle; 2005. PHYLIP (phylogeny inference package) version 3.6.

Publication types

Substances

LinkOut - more resources