Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
- PMID: 16731699
- DOI: 10.1093/bioinformatics/btl158
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Abstract
In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.
Similar articles
-
Search and clustering orders of magnitude faster than BLAST.Bioinformatics. 2010 Oct 1;26(19):2460-1. doi: 10.1093/bioinformatics/btq461. Epub 2010 Aug 12. Bioinformatics. 2010. PMID: 20709691
-
CD-HIT: accelerated for clustering the next-generation sequencing data.Bioinformatics. 2012 Dec 1;28(23):3150-2. doi: 10.1093/bioinformatics/bts565. Epub 2012 Oct 11. Bioinformatics. 2012. PMID: 23060610 Free PMC article.
-
Acceleration of sequence clustering using longest common subsequence filtering.BMC Bioinformatics. 2013;14 Suppl 8(Suppl 8):S7. doi: 10.1186/1471-2105-14-S8-S7. Epub 2013 May 9. BMC Bioinformatics. 2013. PMID: 23815271 Free PMC article.
-
Clustered sequence representation for fast homology search.J Comput Biol. 2007 Jun;14(5):594-614. doi: 10.1089/cmb.2007.R005. J Comput Biol. 2007. PMID: 17683263 Review.
-
Discovering sequence motifs.Methods Mol Biol. 2008;452:231-51. doi: 10.1007/978-1-60327-159-2_12. Methods Mol Biol. 2008. PMID: 18566768 Review.
Cited by
-
Diversity and distribution of a prevalent Microviridae group across the global oceans.Commun Biol. 2024 Oct 23;7(1):1377. doi: 10.1038/s42003-024-07085-6. Commun Biol. 2024. PMID: 39443614 Free PMC article.
-
Seasonal variations of microbial communities and viral diversity in fishery-enhanced marine ranching sediments: insights into metabolic potentials and ecological interactions.Microbiome. 2024 Oct 21;12(1):209. doi: 10.1186/s40168-024-01922-w. Microbiome. 2024. PMID: 39434181 Free PMC article.
-
The chaperonin-60 universal target is a barcode for bacteria that enables de novo assembly of metagenomic sequence data.PLoS One. 2012;7(11):e49755. doi: 10.1371/journal.pone.0049755. Epub 2012 Nov 26. PLoS One. 2012. PMID: 23189159 Free PMC article.
-
Starvation-induced changes in the proteome and transcriptome of the salivary glands of leech (Hirudo nipponia).PLoS One. 2024 Jun 26;19(6):e0304453. doi: 10.1371/journal.pone.0304453. eCollection 2024. PLoS One. 2024. PMID: 38923974 Free PMC article.
-
Comparative genome analysis of Trichophyton rubrum and related dermatophytes reveals candidate genes involved in infection.mBio. 2012 Sep 4;3(5):e00259-12. doi: 10.1128/mBio.00259-12. Print 2012. mBio. 2012. PMID: 22951933 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
