Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;9(8):e1003176.
doi: 10.1371/journal.pcbi.1003176. Epub 2013 Aug 22.

From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction

Affiliations

From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction

Simona Cocco et al. PLoS Comput Biol. 2013.

Abstract

Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant 'patterns' of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Pattern selection by maximum likelihood and pattern prefactors.
(Left panel) Contribution of patterns to the log-likelihood (full red line) as a function of the corresponding eigenvalues formula image of the Pearson correlation matrix formula image. To select formula image patterns, a log-likelihood threshold formula image (dashed black line) has to be chosen such that there are exactly formula image patterns with formula image. This corresponds to eigenvalues in the left and right tail of the spectrum of formula image. (right panel) Pattern prefactors formula image (full red line) as a function of the eigenvalue formula image. Patterns corresponding to formula image have essentially vanishing prefactors; patterns associated to large formula image (formula image) have prefactors smaller than 1 (dashed black line), while patterns corresponding to small formula image (formula image) have unbounded prefactors.
Figure 2
Figure 2. Eigenvalues, localization and contributions to couplings for PF00014.
(From top to bottom): (top panel) Spectral density as a function of the eigenvalues formula image, note the existence of few very large eigenvalues, and a pronounced peak in formula image. (middle panel) Inverse participation ratio of the Hopfield patterns as a function of the corresponding eigenvalue formula image. Large IPR characterizes the concentration of a pattern to few positions and amino acids. (bottom panel) Typical contribution formula image to couplings due to each Hopfield pattern, defined in Eq. (26), as a function of the corresponding eigenvalue formula image. Large contributions are mostly found for small eigenvalues, while patterns corresponding to formula image do not contribute to couplings.
Figure 3
Figure 3. Attractive and repulsive patterns for PF00014.
(Upper panels) The most localized repulsive patterns (corresponding to the first, third and fourth smallest eigenvalues and inverse participation ratios formula image respectively) are strongly concentrated in pairs of positions. (lower panels) The most attractive patterns (corresponding to the three largest eigenvalues); the top pattern is extended, with inverse participation ratio formula image, while the second and third patterns,with inverse participation ratios formula image respectively, have essentially non-zero components over the gap symbols only which accumulate on the edges of the sequence. Note the formula image-coordinates formula image; its integer part is the site index, formula image, and the fractional part multiplied by formula image is the residue value, formula image.
Figure 4
Figure 4. The principal component and predicted contacts visualized on the 3D structure of the trypsin inhibitor protein domain PF00014.
(A) The 10 positions (residue ID 5,12,14,22,23,30,35,40,51,55) of largest entries in the most attractive Hopfield pattern (largest eigenvalue of formula image, corresponding to the principal component) are shown in blue, they correspond also to very conserved sites. Note that, while they are distant along the protein backbone, they cluster into spatially connected components in the folded protein. (B) The 50 residue pairs with strongest couplings (ranked according to the Frobenius norms Eq. (40), with at least 5 positions separation along the backbone, are connected by lines. Only two out of these pairs are not in contact (blue links), all other 48 are thus true-positive contact predictions (red links). Many contacts link pairs of not conserved positions. Note that links are drawn between C-alpha atoms, whereas contacts are defined via minimal all-atom distances, making some red lines to appear rather long even if corresponding to native contacts.
Figure 5
Figure 5. Contact map for the PF00014 family.
Filled squares represent the native contact map on the 3D fold (PDB 5pti, with turquoise squares signaling all-atom distances below 5 Å, and grey ones distances between 5 Å and 8 Å). The 50 top predicted contacts with minimal separation of 5 positions along the sequence (formula image) are shown with empty squares: true-positive predictions (distance formula imageÅ) are colored in red, and false-positive predictions in blue. Predictions are made with the Hopfield-Potts model with formula image patterns (bottom right corner) and with formula image patterns (DCA, top left corner). For both values of formula image there are 48 true-positive and 2 false-positive predictions.
Figure 6
Figure 6. Contact predictions for the three considered protein families.
The upper panels show the fraction of the interaction-based contribution to the log-likelihood of the model given the MSA, defined as the ratio of the log-likelihood with formula image selected patterns over the maximal log-likelihood obtained by including all formula image patterns, as a function of the number formula image of selected patterns, it reaches one for formula image corresponding to the Potts model used in DCA. The lower panels show the TP rates as a function of the predicted residue contacts, for various numbers formula image of selected patterns, where selection was done using the maximum-likelihood criterion. formula image gives the contact predictions obtained by DCA approach. Only non-trivial contacts between sites formula image such that formula image are considered in the calculation of the TP rate.
Figure 7
Figure 7. Contact predictions across 15 protein families.
(Left panel) TP rates for the contact prediction with variable numbers formula image of Hopfield-Potts patterns, averaged over 15 distinct protein families. (right panel) TP rates for the contact prediction using only the repulsive (green line) resp. attractive (red line) patterns, which are contained in the formula image most likely patterns (black line), averaged over 15 protein families. It becomes obvious that the contact prediction remains almost unchanged when only the subset of repulsive patterns is used, whereas it drops substantially by keeping only attractive patterns.
Figure 8
Figure 8. Noise reduction due to pattern selection in reduced data sets.
(Full lines) TP rates of mean-field DCA for sub-MSAs of family PF00014 with formula image sequences; each curve is averaged over 200 randomly selected sub-alignments. Whereas for formula image and formula image the accuracy of the first predictions is close to one, mean-field DCA does not extract any reasonable signal for formula image and formula image. (dashed lines) The same sub-MSA are analyzed with the Hopfield-Potts model using formula image patterns (maximum-likelihood selection). Whereas this selection reduces the accuracy for formula image, it results in increased TP rates for formula image. Dimensional reduction by pattern selection has lead to an efficient noise reduction.

Similar articles

Cited by

References

    1. Pagani I, Liolios K, Jansson J, Chen I, Smirnova T, et al. (2012) The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 40: D571. - PMC - PubMed
    1. The Uniprot Consortium (2012) Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res 40: D71. - PMC - PubMed
    1. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate JG, et al. (2012) The Pfam protein families database. Nucleic Acids Res 40: D290. - PMC - PubMed
    1. Berman HM, Kleywegt GJ, Nakamura H, Markley JL (2012) The protein data bank at 40: Reflecting on the past to prepare for the future. Structure 20: 391–396. - PMC - PubMed
    1. Göbel U, Sander C, Schneider R, Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins: Struct, Funct, Genet 18: 309. - PubMed

Publication types

LinkOut - more resources