TreeCluster: Clustering biological sequences using phylogenetic trees
- PMID: 31437182
- PMCID: PMC6705769
- DOI: 10.1371/journal.pone.0221068
TreeCluster: Clustering biological sequences using phylogenetic trees
Abstract
Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
Similar articles
-
On the quality of tree-based protein classification.Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12. Bioinformatics. 2005. PMID: 15647305
-
Ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses.Microbiome. 2016 Feb 24;4:11. doi: 10.1186/s40168-016-0153-6. Microbiome. 2016. PMID: 26905735 Free PMC article.
-
Bayesian coestimation of phylogeny and sequence alignment.BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83. BMC Bioinformatics. 2005. PMID: 15804354 Free PMC article.
-
SEPP: SATé-enabled phylogenetic placement.Pac Symp Biocomput. 2012:247-58. doi: 10.1142/9789814366496_0024. Pac Symp Biocomput. 2012. PMID: 22174280
-
Combinatorics of distance-based tree inference.Proc Natl Acad Sci U S A. 2012 Oct 9;109(41):16443-8. doi: 10.1073/pnas.1118368109. Epub 2012 Sep 25. Proc Natl Acad Sci U S A. 2012. PMID: 23012403 Free PMC article.
Cited by
-
Phylogeography and transmission of M. tuberculosis in Moldova: A prospective genomic analysis.PLoS Med. 2022 Feb 22;19(2):e1003933. doi: 10.1371/journal.pmed.1003933. eCollection 2022 Feb. PLoS Med. 2022. PMID: 35192619 Free PMC article.
-
pLS20 is the archetype of a new family of conjugative plasmids harboured by Bacillus species.NAR Genom Bioinform. 2021 Oct 27;3(4):lqab096. doi: 10.1093/nargab/lqab096. eCollection 2021 Dec. NAR Genom Bioinform. 2021. PMID: 34729475 Free PMC article.
-
Evaluation of the Increased Genetic Resolution and Utility for Source Tracking of a Recently Developed Method for Genotyping Cyclospora cayetanensis.Microorganisms. 2024 Apr 24;12(5):848. doi: 10.3390/microorganisms12050848. Microorganisms. 2024. PMID: 38792677 Free PMC article.
-
A systematic screen for co-option of transposable elements across the fungal kingdom.Mob DNA. 2024 Jan 20;15(1):2. doi: 10.1186/s13100-024-00312-1. Mob DNA. 2024. PMID: 38245743 Free PMC article.
-
Automated identification of sequence-tailored Cas9 proteins using massive metagenomic data.Nat Commun. 2022 Oct 29;13(1):6474. doi: 10.1038/s41467-022-34213-9. Nat Commun. 2022. PMID: 36309502 Free PMC article.
References
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
