Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Filters applied. Clear all
. 2009 May 15;25(10):1264-70.
doi: 10.1093/bioinformatics/btp149. Epub 2009 Mar 16.

Using Multi-Data Hidden Markov Models Trained on Local Neighborhoods of Protein Structure to Predict Residue-Residue Contacts

Affiliations
Free PMC article

Using Multi-Data Hidden Markov Models Trained on Local Neighborhoods of Protein Structure to Predict Residue-Residue Contacts

Patrik Björkholm et al. Bioinformatics. .
Free PMC article

Abstract

Motivation: Correct prediction of residue-residue contacts in proteins that lack good templates with known structure would take ab initio protein structure prediction a large step forward. The lack of correct contacts, and in particular long-range contacts, is considered the main reason why these methods often fail.

Results: We propose a novel hidden Markov model (HMM)-based method for predicting residue-residue contacts from protein sequences using as training data homologous sequences, predicted secondary structure and a library of local neighborhoods (local descriptors of protein structure). The library consists of recurring structural entities incorporating short-, medium- and long-range interactions and is general enough to reassemble the cores of nearly all proteins in the PDB. The method is tested on an external test set of 606 domains with no significant sequence similarity to the training set as well as 151 domains with SCOP folds not present in the training set. Considering the top 0.2 x L predictions (L = sequence length), our HMMs obtained an accuracy of 22.8% for long-range interactions in new fold targets, and an average accuracy of 28.6% for long-, medium- and short-range contacts. This is a significant performance increase over currently available methods when comparing against results published in the literature.

Availability: http://predictioncenter.org/Services/FragHMMent/.

Figures

Fig. 1.
Fig. 1.
The local descriptor denoted 1gr8a_#407 (i.e. the local neighborhood around amino acid number 407 in protein domain 1gr8a_). The left figure shows the local descriptor 1gr8a_#407 (red) in the structure domain 1gr8a_, while the middle figure shows a close up of the same local descriptor. It consists of three fragments that are in proximity to each other in space but not along the amino acid sequence. The right figure shows the structural alignment of similar local descriptors in other ASTRAL domains. The corresponding sequence alignment is also shown.
Fig. 2.
Fig. 2.
The topology of the HMMs used to align structural neighborhoods to a target sequence. The underlying red structure represents states emitting secondary structure (labeled ‘ss’) and the blue overlaying structure represents states emitting amino acids (labeled ‘aa’). The arrows represent allowed transitions between the three different states: matches (M), insertions (I) and deletions (D). Although the model has two layers with separate arrows representing transitions, the model is always in the same state in both layers (corresponding to the same position in the multiple alignment, see Fig. 1) and the transition probabilities from these two states are the same. Thus, each such pair of states (e.g. M12 aa and M12 ss) has one transition probability and two emission probabilities.
Fig. 3.
Fig. 3.
Method overview.
Fig. 4.
Fig. 4.
The average percentage of descriptor segments in a group that has been aligned within a certain residue distance from the true positions.
Fig. 5.
Fig. 5.
The three curves show the distribution of short-, medium- and long-range contacts as a function of sequence length. The blue fields represent the density of Pct for the specific amino acid sequence length (x-axis). The red line is a spline curve extracted from the different range distributions.
Fig. 6.
Fig. 6.
Contact prediction accuracy (Pct = 0.2) for new fold targets plotted against the fraction of the targets structurally matched by at least one local descriptor group in the library (i.e. descriptor coverage). Correlation coefficients between prediction accuracy and descriptor coverage are low and equal to 0.11, 0.03, 0.06 and 0.09 for long-, medium-, short- and all-range contacts, respectively (see text).
Fig. 7.
Fig. 7.
(A) The assignment of group 1gr8a_#407 (Fig. 1) to the recombinational repair protein RecR (PDB code 1vdd, chain A). Positions 1–54 in the structure are not shown. (B) Contacts correctly predicted by the group. Medium-range contacts are in blue and long-range contacts in green.

Similar articles

See all similar articles

Cited by 18 articles

See all "Cited by" articles

Publication types

Feedback