CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures
- PMID: 18052539
- PMCID: PMC2098860
- DOI: 10.1371/journal.pcbi.0030232
CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures
Abstract
We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure-based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification.
Conflict of interest statement
Figures
Similar articles
-
Recognizing the fold of a protein structure.Bioinformatics. 2003 Sep 22;19(14):1748-59. doi: 10.1093/bioinformatics/btg240. Bioinformatics. 2003. PMID: 14512345
-
Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.PLoS Comput Biol. 2009 Mar;5(3):e1000331. doi: 10.1371/journal.pcbi.1000331. Epub 2009 Mar 27. PLoS Comput Biol. 2009. PMID: 19325884 Free PMC article.
-
The CATH database: an extended protein family resource for structural and functional genomics.Nucleic Acids Res. 2003 Jan 1;31(1):452-5. doi: 10.1093/nar/gkg062. Nucleic Acids Res. 2003. PMID: 12520050 Free PMC article.
-
An introduction to modeling structure from sequence.Curr Protoc Bioinformatics. 2006 Oct;Chapter 5:Unit 5.1. doi: 10.1002/0471250953.bi0501s15. Curr Protoc Bioinformatics. 2006. PMID: 18428765 Review.
-
The folding and evolution of multidomain proteins.Nat Rev Mol Cell Biol. 2007 Apr;8(4):319-30. doi: 10.1038/nrm2144. Epub 2007 Mar 14. Nat Rev Mol Cell Biol. 2007. PMID: 17356578 Review.
Cited by
-
New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures.Nucleic Acids Res. 2013 Jan;41(Database issue):D490-8. doi: 10.1093/nar/gks1211. Epub 2012 Nov 29. Nucleic Acids Res. 2013. PMID: 23203873 Free PMC article.
-
The CATH hierarchy revisited-structural divergence in domain superfamilies and the continuity of fold space.Structure. 2009 Aug 12;17(8):1051-62. doi: 10.1016/j.str.2009.06.015. Structure. 2009. PMID: 19679085 Free PMC article.
-
RUPEE: A fast and accurate purely geometric protein structure search.PLoS One. 2019 Mar 15;14(3):e0213712. doi: 10.1371/journal.pone.0213712. eCollection 2019. PLoS One. 2019. PMID: 30875409 Free PMC article.
-
Complex evolutionary footprints revealed in an analysis of reused protein segments of diverse lengths.Proc Natl Acad Sci U S A. 2017 Oct 31;114(44):11703-11708. doi: 10.1073/pnas.1707642114. Epub 2017 Oct 19. Proc Natl Acad Sci U S A. 2017. PMID: 29078314 Free PMC article.
-
Hierarchical Analysis of Protein Structures: From Secondary Structures to Protein Units and Domains.Methods Mol Biol. 2025;2870:357-370. doi: 10.1007/978-1-0716-4213-9_18. Methods Mol Biol. 2025. PMID: 39543044
References
-
- Apic G, Gough J, Teichmann SA. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol. 2001;310:311–325. - PubMed
-
- Orengo CA, Jones DT, Thornton JM. Protein superfamilies and domain superfolds. Nature. 1994;372:631–634. - PubMed
-
- Coulson AF, Moult J. A unifold, mesofold, and superfold model of protein fold use. Proteins. 2002;46:61–71. - PubMed
-
- Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–540. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
