Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2009 Feb;37(2):452-62.
doi: 10.1093/nar/gkn944. Epub 2008 Dec 4.

FIEFDom: A Transparent Domain Boundary Recognition System Using a Fuzzy Mean Operator

Affiliations
Free PMC article
Comparative Study

FIEFDom: A Transparent Domain Boundary Recognition System Using a Fuzzy Mean Operator

Rajkumar Bondugula et al. Nucleic Acids Res. .
Free PMC article

Abstract

Protein domain prediction is often the preliminary step in both experimental and computational protein research. Here we present a new method to predict the domain boundaries of a multidomain protein from its amino acid sequence using a fuzzy mean operator. Using the nr-sequence database together with a reference protein set (RPS) containing known domain boundaries, the operator is used to assign a likelihood value for each residue of the query sequence as belonging to a domain boundary. This procedure robustly identifies contiguous boundary regions. For a dataset with a maximum sequence identity of 30%, the average domain prediction accuracy of our method is 97% for one domain proteins and 58% for multidomain proteins. The presented model is capable of using new sequence/structure information without re-parameterization after each RPS update. When tested on a current database using a four year old RPS and on a database that contains different domain definitions than those used to train the models, our method consistently yielded the same accuracy while two other published methods did not. A comparison with other domain prediction methods used in the CASP7 competition indicates that our method performs better than existing sequence-based methods.

Figures

Figure 1.
Figure 1.
The fragments retrieved when the RPS is searched for matching fragments with a typical protein. The fragments shown are labeled using their SCOP definitions. Residues labeled ‘D’ lie in protein domains, whereas residues labeled ‘B’™ lie on the domain boundary; ‘–’ is used to indicate that no residue in the current fragment is aligned with the query sequence. For the Alanine residue (A) in the shaded box, the domain boundary propensity is calculated using Equation 2 based on the five aligned residues (K = 5), four of which are found in non-boundary regions and one is found in a boundary region. The importance of these contributions is inversely weighted by their respective scores, S, shown on the right, as detailed in Equation 2. In this case, the likelihood PB that the alanine residue belongs to domain boundary is 0.0804.
Figure 2.
Figure 2.
The predicted raw domain boundary propensity (solid line) of the Escherichia coli MurF enzyme, PDB code 1GG4, chain A. Two regions that potentially contain domain boundaries are identified. The post-processing results in two predicted boundaries centered on residues 91 and 314 (dotted lines), whereas the true boundaries are centered on residues 98 and 313 (data not shown). The background noise that gets filtered out during the post-processing can be seen at the COOH- and NH2-terminal ends of the sequence.
Figure 3.
Figure 3.
The effect of threshold on the performance of FIEFDom for the SCOP 1.73 (30%) dataset. (a) Receiver operating characteristic (ROC) curve averaged over all of the domain sets is plotted as the threshold (T) is varied from 0 to 1 in intervals of 0.1. (b) One-domain (blue solid line), two-domain (pink dashed line), three-domain (black dotted line), four-domain (red dashed-dotted line) and the average domain boundary prediction accuracy are plotted as a function of the threshold value, T. Based on the maximum and slow variability of the accuracy values over a range of T values, we selected T = 0.4 as the appropriate value to be used in our model.
Figure 4.
Figure 4.
(a) One-domain (red dashed line), two-domain (blue dashed-dotted line), three-domain (green dotted line), four-domain (solid magenta line) and average (bold solid black line) domain prediction accuracies are plotted as a function of database version. As time progresses, new information can be added to the prediction algorithm by updating the RPS. As the number of sequences in the database increases, the prediction accuracy improves. (b) The same domain prediction accuracies as in (a) are plotted as a function of maximum sequence identity cutoff in the RPS. More structural information is added to the prediction system by increasing the maximum sequence identity among proteins in the RPS.

Similar articles

See all similar articles

Cited by 10 articles

See all "Cited by" articles

References

    1. Dill KA, Ozkan SB, Weikl TR, Chodera JD, Voelz VA. The protein folding problem: when will it be solved? Curr. Opin. Struct. Biol. 2007;17:342–346. - PubMed
    1. Buchete NV, Straub JE, Thirumalai D. Development of novel statistical potentials for protein fold recognition. Curr. Opin. Struct. Biol. 2004;14:225–232. - PubMed
    1. Zhang Y. Progress and challenges in protein structure prediction. Curr. Opin. Struct. Biol. 2008;18:342–348. - PMC - PubMed
    1. Richardson JS. The anatomy and taxonomy of protein structure. Adv. Protein Chem. 1981;34:167–339. - PubMed
    1. Wetlaufer DB. Nucleation, rapid folding, and globular intrachain regions in proteins. Proc. Natl Acad. Sci. USA. 1973;70:697–701. - PMC - PubMed

Publication types

Feedback