From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction
- PMID: 23990764
- PMCID: PMC3749948
- DOI: 10.1371/journal.pcbi.1003176
From principal component to direct coupling analysis of coevolution in proteins: low-eigenvalue modes are needed for structure prediction
Abstract
Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant 'patterns' of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
of the Pearson correlation matrix
. To select
patterns, a log-likelihood threshold
(dashed black line) has to be chosen such that there are exactly
patterns with
. This corresponds to eigenvalues in the left and right tail of the spectrum of
. (right panel) Pattern prefactors
(full red line) as a function of the eigenvalue
. Patterns corresponding to
have essentially vanishing prefactors; patterns associated to large
(
) have prefactors smaller than 1 (dashed black line), while patterns corresponding to small
(
) have unbounded prefactors.
, note the existence of few very large eigenvalues, and a pronounced peak in
. (middle panel) Inverse participation ratio of the Hopfield patterns as a function of the corresponding eigenvalue
. Large IPR characterizes the concentration of a pattern to few positions and amino acids. (bottom panel) Typical contribution
to couplings due to each Hopfield pattern, defined in Eq. (26), as a function of the corresponding eigenvalue
. Large contributions are mostly found for small eigenvalues, while patterns corresponding to
do not contribute to couplings.
respectively) are strongly concentrated in pairs of positions. (lower panels) The most attractive patterns (corresponding to the three largest eigenvalues); the top pattern is extended, with inverse participation ratio
, while the second and third patterns,with inverse participation ratios
respectively, have essentially non-zero components over the gap symbols only which accumulate on the edges of the sequence. Note the
-coordinates
; its integer part is the site index,
, and the fractional part multiplied by
is the residue value,
.
, corresponding to the principal component) are shown in blue, they correspond also to very conserved sites. Note that, while they are distant along the protein backbone, they cluster into spatially connected components in the folded protein. (B) The 50 residue pairs with strongest couplings (ranked according to the Frobenius norms Eq. (40), with at least 5 positions separation along the backbone, are connected by lines. Only two out of these pairs are not in contact (blue links), all other 48 are thus true-positive contact predictions (red links). Many contacts link pairs of not conserved positions. Note that links are drawn between C-alpha atoms, whereas contacts are defined via minimal all-atom distances, making some red lines to appear rather long even if corresponding to native contacts.
) are shown with empty squares: true-positive predictions (distance
Å) are colored in red, and false-positive predictions in blue. Predictions are made with the Hopfield-Potts model with
patterns (bottom right corner) and with
patterns (DCA, top left corner). For both values of
there are 48 true-positive and 2 false-positive predictions.
selected patterns over the maximal log-likelihood obtained by including all
patterns, as a function of the number
of selected patterns, it reaches one for
corresponding to the Potts model used in DCA. The lower panels show the TP rates as a function of the predicted residue contacts, for various numbers
of selected patterns, where selection was done using the maximum-likelihood criterion.
gives the contact predictions obtained by DCA approach. Only non-trivial contacts between sites
such that
are considered in the calculation of the TP rate.
of Hopfield-Potts patterns, averaged over 15 distinct protein families. (right panel) TP rates for the contact prediction using only the repulsive (green line) resp. attractive (red line) patterns, which are contained in the
most likely patterns (black line), averaged over 15 protein families. It becomes obvious that the contact prediction remains almost unchanged when only the subset of repulsive patterns is used, whereas it drops substantially by keeping only attractive patterns.
sequences; each curve is averaged over 200 randomly selected sub-alignments. Whereas for
and
the accuracy of the first predictions is close to one, mean-field DCA does not extract any reasonable signal for
and
. (dashed lines) The same sub-MSA are analyzed with the Hopfield-Potts model using
patterns (maximum-likelihood selection). Whereas this selection reduces the accuracy for
, it results in increased TP rates for
. Dimensional reduction by pattern selection has lead to an efficient noise reduction.Similar articles
-
Improving residue-residue contact prediction via low-rank and sparse decomposition of residue correlation matrix.Biochem Biophys Res Commun. 2016 Mar 25;472(1):217-22. doi: 10.1016/j.bbrc.2016.01.188. Epub 2016 Feb 23. Biochem Biophys Res Commun. 2016. PMID: 26920058
-
Distance matrix-based approach to protein structure prediction.J Struct Funct Genomics. 2009 Mar;10(1):67-81. doi: 10.1007/s10969-009-9062-2. Epub 2009 Feb 18. J Struct Funct Genomics. 2009. PMID: 19224393 Free PMC article.
-
Protein contact prediction by integrating deep multiple sequence alignments, coevolution and machine learning.Proteins. 2018 Mar;86 Suppl 1(Suppl 1):84-96. doi: 10.1002/prot.25405. Epub 2017 Oct 31. Proteins. 2018. PMID: 29047157 Free PMC article.
-
Prediction of Structures and Interactions from Genome Information.Adv Exp Med Biol. 2018;1105:123-152. doi: 10.1007/978-981-13-2200-6_9. Adv Exp Med Biol. 2018. PMID: 30617827 Review.
-
Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models.PLoS Comput Biol. 2015 Jul 30;11(7):e1004182. doi: 10.1371/journal.pcbi.1004182. eCollection 2015 Jul. PLoS Comput Biol. 2015. PMID: 26225866 Free PMC article. Review.
Cited by
-
Mi3-GPU: MCMC-based Inverse Ising Inference on GPUs for protein covariation analysis.Comput Phys Commun. 2021 Mar;260:107312. doi: 10.1016/j.cpc.2020.107312. Epub 2020 Apr 17. Comput Phys Commun. 2021. PMID: 33716309 Free PMC article.
-
Pareto Optimization of Combinatorial Mutagenesis Libraries.IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1143-1153. doi: 10.1109/TCBB.2018.2858794. Epub 2018 Jul 23. IEEE/ACM Trans Comput Biol Bioinform. 2019. PMID: 30040654 Free PMC article.
-
Mechanical couplings of protein backbone and side chains exhibit scale-free network properties and specific hotspots for function.Comput Struct Biotechnol J. 2021 Sep 8;19:5309-5320. doi: 10.1016/j.csbj.2021.09.004. eCollection 2021. Comput Struct Biotechnol J. 2021. PMID: 34765086 Free PMC article.
-
Deep Analysis of Residue Constraints (DARC): identifying determinants of protein functional specificity.Sci Rep. 2020 Feb 3;10(1):1691. doi: 10.1038/s41598-019-55118-6. Sci Rep. 2020. PMID: 32015389 Free PMC article.
-
Improving contact prediction along three dimensions.PLoS Comput Biol. 2014 Oct 9;10(10):e1003847. doi: 10.1371/journal.pcbi.1003847. eCollection 2014 Oct. PLoS Comput Biol. 2014. PMID: 25299132 Free PMC article.
References
-
- Göbel U, Sander C, Schneider R, Valencia A (1994) Correlated mutations and residue contacts in proteins. Proteins: Struct, Funct, Genet 18: 309. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
