Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 5;12(1):2535.
doi: 10.1038/s41467-021-22869-8.

CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction

Affiliations

CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction

Fusong Ju et al. Nat Commun. .

Abstract

Residue co-evolution has become the primary principle for estimating inter-residue distances of a protein, which are crucially important for predicting protein structure. Most existing approaches adopt an indirect strategy, i.e., inferring residue co-evolution based on some hand-crafted features, say, a covariance matrix, calculated from multiple sequence alignment (MSA) of target protein. This indirect strategy, however, cannot fully exploit the information carried by MSA. Here, we report an end-to-end deep neural network, CopulaNet, to estimate residue co-evolution directly from MSA. The key elements of CopulaNet include: (i) an encoder to model context-specific mutation for each residue; (ii) an aggregator to model residue co-evolution, and thereafter estimate inter-residue distances. Using CASP13 (the 13th Critical Assessment of Protein Structure Prediction) target proteins as representatives, we demonstrate that CopulaNet can predict protein structure with improved accuracy and efficiency. This study represents a step toward improved end-to-end prediction of inter-residue distances and protein tertiary structures.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The limitation of the covariance-based methods in estimating inter-residue distances.
a Two artifactual proteins P1 and P2. In protein P1, two residues R1 and R2 are close, whereas in protein P2, they are far from each other. b The MSAs constructed for the two proteins show considerable difference. c The covariance matrices calculated from these two MSAs are totally identical; thus, the covariance-based methods give the same estimation of inter-residue distances for protein P1 and P2. This is contradict the true inter-residue distances. d Unlike the covariance matrices, the conditional joint-residue distribution P(R1, R2R3) could effectively distinguish these two MSAs.
Fig. 2
Fig. 2. Predicting protein tertiary structure using ProFOLD.
Here, we use the CASP13 target protein T0992-D1 as an example to describe the main steps of ProFOLD. Only the first 13 residues are shown here for the sake of clear description. First, we search this protein against sequence databases to identify its homologous proteins (2,807 proteins in total). Next, we use the acquired homologous protein to construct an MSA for this protein. Then we apply CopulaNet to infer residue co-evolution directly from the MSA. CopulaNet uses an MSA encoder to model the mutation information for each residue of the target protein, and then uses a co-evolution aggregator to measure the residue co-mutations. According to the acquired residue co-evolution information, the distance estimator estimates inter-residue distances. Finally, we transform the estimated distance distributions into a potential function, and then search for the structure conformation with the minimal potential. ProFOLD reports the structural conformation with sufficiently low potential as the final prediction result (TMscore: 0.84).
Fig. 3
Fig. 3. Precision of the predicted inter-residue contacts.
Here, the most probable L/5, L/2 and L long-range residue contacts are shown, where L represents protein length. The phrase "long-range" refers to two residues with sequence separation over 24 residues. For all CASP13 target proteins, ProFOLD outperformed the state-of-the-art approaches. In particular, for the 31 FM domains, ProFOLD achieved precision of 0.840, 0.713 and 0.567 for the most probable L/5, L/2 and L contacts, which is significantly higher than AlphaFold, by 0.128, 0.117 and 0.097, respectively.
Fig. 4
Fig. 4. Precision of the predicted inter-residue contacts by the variant ProFOLD w/o R.
a For the 31 CASP13 FM targets, the precision increases with the receptive field size and finally reaches 0.382. b On the validation set with 1820 proteins, the precision also increases with the receptive field size and finally reaches 0.424. Even using the "encoder and aggregator'' framework alone, the variant ProFOLD w/o R still outperformed CCMpred on the two datasets (0.219 and 0.382, respectively).
Fig. 5
Fig. 5. Quality of the predicted tertiary structures for CASP13 FM target proteins.
a ProFOLD predicted more high-quality structures than the state-of-the-art approaches. When using the popular cut-off threshold for high-quality structures (TMscore ≥0.70), ProFOLD predicted high-quality structures for 18 out of the 31 domains, whereas AlphaFold and trRosetta predicted high-quality structure for only 12 and 7 domains, respectively. b Head-to-head comparison clearly demonstrates the advantages of ProFOLD over AlphaFold: for 24 out of the 31 FM domains, ProFOLD outperformed AlphaFold.
Fig. 6
Fig. 6. Investigation of possible factors that might affect the performance of ProFOLD.
Correlation between quality of the predicted structures and (a) Meff, (b) the average probability of top L predicted contacts (PPC). For the CASP13 FM target proteins, the correlation coefficient between Meff and TMscore of the predicted structures by ProFOLD is as high as 0.69. The correlation efficient between PPC and TMscore of the predicted structures is 0.82.
Fig. 7
Fig. 7. Comparison of the predicted inter-residue distances (bottom left) with the ground-truth distances (upper right) for protein T1022s1-D1.
a ProFOLD w/o E+R performed poorly and failed to generate high-quality distance estimations. b When equipped with the MSA encoder module, the variant ProFOLD w/o R could generate relatively accurate distance estimations. c When both MSA encoder and 2D ResNet are used, ProFOLD gave distance estimations extremely close to the real distance values.

Similar articles

Cited by

References

    1. Branden, Carl and Tooze, John. Introduction to protein structure. Garland Science, New York, 2 edition, 1 1999.
    1. Dill KA, MacCallum JL. The protein-folding problem, 50 years on. Science. 2012;338:1042–1046. doi: 10.1126/science.1219021. - DOI - PubMed
    1. Roy A, Kucukural A, Zhang Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 2010;5:725–738. doi: 10.1038/nprot.2010.5. - DOI - PMC - PubMed
    1. Yang J, et al. The I-TASSER suite: protein structure and function prediction. Nat. Methods. 2015;12:7–8. doi: 10.1038/nmeth.3213. - DOI - PMC - PubMed
    1. Kuhlman B, Bradley P. Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 2019;20:681–697. doi: 10.1038/s41580-019-0163-x. - DOI - PMC - PubMed

Publication types