DNACLUST: accurate and efficient clustering of phylogenetic marker genes
- PMID: 21718538
- PMCID: PMC3213679
- DOI: 10.1186/1471-2105-12-271
DNACLUST: accurate and efficient clustering of phylogenetic marker genes
Abstract
Background: Clustering is a fundamental operation in the analysis of biological sequence data. New DNA sequencing technologies have dramatically increased the rate at which we can generate data, resulting in datasets that cannot be efficiently analyzed by traditional clustering methods.This is particularly true in the context of taxonomic profiling of microbial communities through direct sequencing of phylogenetic markers (e.g. 16S rRNA) - the domain that motivated the work described in this paper. Many analysis approaches rely on an initial clustering step aimed at identifying sequences that belong to the same operational taxonomic unit (OTU). When defining OTUs (which have no universally accepted definition), scientists must balance a trade-off between computational efficiency and biological accuracy, as accurately estimating an environment's phylogenetic composition requires computationally-intensive analyses. We propose that efficient and mathematically well defined clustering methods can benefit existing taxonomic profiling approaches in two ways: (i) the resulting clusters can be substituted for OTUs in certain applications; and (ii) the clustering effectively reduces the size of the data-sets that need to be analyzed by complex phylogenetic pipelines (e.g., only one sequence per cluster needs to be provided to downstream analyses).
Results: To address the challenges outlined above, we developed DNACLUST, a fast clustering tool specifically designed for clustering highly-similar DNA sequences.Given a set of sequences and a sequence similarity threshold, DNACLUST creates clusters whose radius is guaranteed not to exceed the specified threshold. Underlying DNACLUST is a greedy clustering strategy that owes its performance to novel sequence alignment and k-mer based filtering algorithms.DNACLUST can also produce multiple sequence alignments for every cluster, allowing users to manually inspect clustering results, and enabling more detailed analyses of the clustered data.
Conclusions: We compare DNACLUST to two popular clustering tools: CD-HIT and UCLUST. We show that DNACLUST is about an order of magnitude faster than CD-HIT and UCLUST (exact mode) and comparable in speed to UCLUST (approximate mode). The performance of DNACLUST improves as the similarity threshold is increased (tight clusters) making it well suited for rapidly removing duplicates and near-duplicates from a dataset, thereby reducing the size of the data being analyzed through more elaborate approaches.
Figures
Similar articles
-
SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.Nucleic Acids Res. 2023 May 8;51(8):e46. doi: 10.1093/nar/gkad158. Nucleic Acids Res. 2023. PMID: 36912074 Free PMC article.
-
hc-OTU: A Fast and Accurate Method for Clustering Operational Taxonomic Units Based on Homopolymer Compaction.IEEE/ACM Trans Comput Biol Bioinform. 2018 Mar-Apr;15(2):441-451. doi: 10.1109/TCBB.2016.2535326. Epub 2016 Feb 26. IEEE/ACM Trans Comput Biol Bioinform. 2018. PMID: 26930691
-
A critical analysis of state-of-the-art metagenomics OTU clustering algorithms.J Biosci. 2019 Dec;44(6):148. J Biosci. 2019. PMID: 31894129
-
Advancing analytical algorithms and pipelines for billions of microbial sequences.Curr Opin Biotechnol. 2012 Feb;23(1):64-71. doi: 10.1016/j.copbio.2011.11.028. Epub 2011 Dec 13. Curr Opin Biotechnol. 2012. PMID: 22172529 Free PMC article. Review.
-
scMelody: An Enhanced Consensus-Based Clustering Model for Single-Cell Methylation Data by Reconstructing Cell-to-Cell Similarity.Front Bioeng Biotechnol. 2022 Feb 23;10:842019. doi: 10.3389/fbioe.2022.842019. eCollection 2022. Front Bioeng Biotechnol. 2022. PMID: 35284424 Free PMC article. Review.
Cited by
-
Accurately clustering biological sequences in linear time by relatedness sorting.Nat Commun. 2024 Apr 8;15(1):3047. doi: 10.1038/s41467-024-47371-9. Nat Commun. 2024. PMID: 38589369 Free PMC article.
-
A toolbox of machine learning software to support microbiome analysis.Front Microbiol. 2023 Nov 22;14:1250806. doi: 10.3389/fmicb.2023.1250806. eCollection 2023. Front Microbiol. 2023. PMID: 38075858 Free PMC article. Review.
-
HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing.Genome Biol. 2023 Oct 5;24(1):222. doi: 10.1186/s13059-023-03053-1. Genome Biol. 2023. PMID: 37798751 Free PMC article.
-
iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences.Bioinformatics. 2023 Sep 2;39(9):btad508. doi: 10.1093/bioinformatics/btad508. Bioinformatics. 2023. PMID: 37589603 Free PMC article.
-
Insights into the genetic histories and lifeways of Machu Picchu's occupants.Sci Adv. 2023 Jul 28;9(30):eadg3377. doi: 10.1126/sciadv.adg3377. Epub 2023 Jul 26. Sci Adv. 2023. PMID: 37494435 Free PMC article.
References
-
- Schloss P, Westcott S, Ryabin T, Hall J, Hartmann M, Hollister E, Lesniewski R, Oakley B, Parks D, Robinson C. et al.Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and environmental microbiology. 2009;75(23):7537. doi: 10.1128/AEM.01541-09. - DOI - PMC - PubMed
-
- Felsenstein J. Distributed by the author. Department of Genome Sciences, University of Washington, Seattle; 2005. PHYLIP (phylogeny inference package) version 3.6.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous
