UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches
- PMID: 25398609
- PMCID: PMC4375400
- DOI: 10.1093/bioinformatics/btu739
UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches
Abstract
Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters.
Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
© The Author 2014. Published by Oxford University Press.
Figures
Similar articles
-
UniRef: comprehensive and non-redundant UniProt reference clusters.Bioinformatics. 2007 May 15;23(10):1282-8. doi: 10.1093/bioinformatics/btm098. Epub 2007 Mar 22. Bioinformatics. 2007. PMID: 17379688
-
Uniclust databases of clustered and deeply annotated protein sequences and alignments.Nucleic Acids Res. 2017 Jan 4;45(D1):D170-D176. doi: 10.1093/nar/gkw1081. Epub 2016 Nov 28. Nucleic Acids Res. 2017. PMID: 27899574 Free PMC article.
-
Comparison of topological clustering within protein networks using edge metrics that evaluate full sequence, full structure, and active site microenvironment similarity.Protein Sci. 2015 Sep;24(9):1423-39. doi: 10.1002/pro.2724. Epub 2015 Aug 18. Protein Sci. 2015. PMID: 26073648 Free PMC article.
-
The Universal Protein Resource (UniProt): an expanding universe of protein information.Nucleic Acids Res. 2006 Jan 1;34(Database issue):D187-91. doi: 10.1093/nar/gkj161. Nucleic Acids Res. 2006. PMID: 16381842 Free PMC article.
-
Protein function prediction: towards integration of similarity metrics.Curr Opin Struct Biol. 2011 Apr;21(2):180-8. doi: 10.1016/j.sbi.2011.02.001. Epub 2011 Feb 24. Curr Opin Struct Biol. 2011. PMID: 21353529 Free PMC article. Review.
Cited by
-
Learning deep representations of enzyme thermal adaptation.Protein Sci. 2022 Dec;31(12):e4480. doi: 10.1002/pro.4480. Protein Sci. 2022. PMID: 36261883 Free PMC article.
-
Metagenomics of Parkinson's disease implicates the gut microbiome in multiple disease mechanisms.Nat Commun. 2022 Nov 15;13(1):6958. doi: 10.1038/s41467-022-34667-x. Nat Commun. 2022. PMID: 36376318 Free PMC article.
-
An Introduction to Next Generation Sequencing Bioinformatic Analysis in Gut Microbiome Studies.Biomolecules. 2021 Apr 2;11(4):530. doi: 10.3390/biom11040530. Biomolecules. 2021. PMID: 33918473 Free PMC article. Review.
-
Function-based classification of hazardous biological sequences: Demonstration of a new paradigm for biohazard assessments.Front Bioeng Biotechnol. 2022 Oct 7;10:979497. doi: 10.3389/fbioe.2022.979497. eCollection 2022. Front Bioeng Biotechnol. 2022. PMID: 36277394 Free PMC article.
-
Stabilizing AqdC, a Pseudomonas Quinolone Signal-Cleaving Dioxygenase from Mycobacteria, by FRESCO-Based Protein Engineering.Chembiochem. 2021 Feb 15;22(4):733-742. doi: 10.1002/cbic.202000641. Epub 2020 Nov 16. Chembiochem. 2021. PMID: 33058333 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
