Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2021 May 26;12(1):3168.
doi: 10.1038/s41467-021-23303-9.

Structure-based protein function prediction using graph convolutional networks

Affiliations
Free PMC article
Comparative Study

Structure-based protein function prediction using graph convolutional networks

Vladimir Gligorijević et al. Nat Commun. .
Free PMC article

Abstract

The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/ .

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Schematic method overview.
a LSTM language model, pre-trained on ~10 million Pfam protein sequences, used for extracting residue-level features of PDB sequence. b Our GCN with three graph convolutional layers for learning complex structure–function relationships.
Fig. 2
Fig. 2. Performance of DeepFRI in predicting MF-GO terms of experimental structures and protein models.
a Precision-recall curves showing the performance of DeepFRI on ~700 protein contact maps (PDB700 dataset) from NATIVE PDB structures (CMAP_NATIVE, black), their corresponding Rosetta-predicted lowest energy (LE) models (CMAP-Rosetta_LE, orange) and DMPfold lowest energy (LE) models (CMAP-DMPFold_LE, red), in comparison to the sequence-only CNN-based method (SEQUENCE, blue). All DeepFRI models are trained only on experimental PDB structures. b Distribution of protein-centric Fmax score over 1500 different Rosetta models from the PDB700 dataset grouped by their TM-score computed against the native structures. Data are represented as boxplots with the center line representing the median, upper and lower edges of the boxes representing the interquartile range, and whiskers representing the data range (0.5 × interquartile range). c An example of DeepFRI predictions for Rosetta models of a lipid-binding protein (PDB id: 1IFC) with different TM-scores computed against its native structure. The DeepFRI output score >0.5 is considered as a significant prediction. Precision-recall curves showing the: d performance of our method, trained only on PDB experimental structures, and evaluated on homology models from SWISS-MODEL (red), in comparison to the CNN-based method (DeepGO) trained only on PDB sequences, and BLAST baselines are shown in blue and gray, respectively; e performance of DeepFRI trained on PDB (blue), SWISS-MODEL (orange) and both PDB and SWISS-MODEL (red) structures in comparison to the BLAST baseline (gray). The dot on the curve indicates where the maximum F-score is achieved (the perfect prediction should have Fmax = 1 at the top right corner of the plot).
Fig. 3
Fig. 3. Performance over GO terms in different ontologies and EC numbers.
Precision-recall curves showing the performance of different methods on (a) MF-GO terms and (c) EC numbers on the test set comprised of PDB chains chosen to have ≤30% sequence identity to the chains in the training set. Coverage of the methods is shown in the legend. Distribution of the Fmax score under 100 bootstrap iterations for the top three best-performing methods applied on (b) MF-GO terms and (c) EC numbers computed on the test PDB chains and grouped by maximum % sequence identity to the training set. e Distribution of protein-centric Fmax score and function-centric AUPR score under 10 bootstrap iterations summarized over all test proteins and GO terms/EC numbers, respectively. f Distribution of AUPR score on MF-GO terms of different levels of specificities under 10 bootstrap iterations. Every figure illustrates the performance of DeepFRI (red) in comparison to sequence-based annotation transfer from protein families, FunFams (blue), the CNN-based method DeepGO (orange), SVM-based method, FFPred (black), and BLAST baseline (gray). Error bars on the bar plots (e and f) represent standard deviation of the mean. In panels b and d, data are represented as boxplots with the center line representing the median, upper and lower edges of the boxes representing the interquartile range, and whiskers representing the data range (0.5 × interquartile range).
Fig. 4
Fig. 4. Automatic mapping of function prediction to sites on protein structures.
a An example of the gradient-weighted class activation map for ‘Ca Ion Binding’ (right) mapped onto the 3D structure of rat α-parvalbumin (PDB Id: 1S3P), chain A (left), annotated with calcium ion binding. The two highest peaks in the grad-CAM activation profile correspond to calcium-binding residues. b ROC curves showing the overlap between grad-CAM activation profiles and binding sites, retrieved from the BioLiP database, computed for the PDB chains shown in panel (c). c Examples of other PDB chains annotated with DNA binding, GTP binding, and glutathione transferase activity. All residues are colored using a gradient color scheme matching the grad-CAM activity profile, with more salient residues highlighted in red and less salient residues highlighted in blue. No information about co-factors, active sites, or site-specificity was used during training of the model.
Fig. 5
Fig. 5. Identifying catalytic residues in enzymes using grad-CAM applied on the DeepFRI model trained on EC numbers.
All residues are colored using a gradient color scheme matching the grad-CAM activity score, with more salient residues highlighted in red and less salient residues highlighted in blue. The PDB chains (shown in panels ai) are annotated with all of its known catalytic residues (available in Catalytic Site Atlas), with a residue number and a pointer to the location on the structure. Residues correctly identified by our method are highlighted in red.
Fig. 6
Fig. 6. Predicting and mapping function to unannotated PDB & SWISS-MODEL chains.
Percentage/number of PDB chains (a) and SWISS-MODEL chains (b) with MF-, BP-, and CC-GO terms predicted by our method; the number of specific GO term predictions (with IC >5) are shown in blue and red for PDB and SWISS-MODEL chains, respectively. c An example of a Fe–S-cluster-containing hydrogenase (PDB Id: 6F0K), found in Rhodothermus marinus, with missing GO term annotations in SIFTS (unannotated). The PDB chain lacks annotations in databases used for training our model and DeepFRI predicts to bind a 4Fe–4S iron–sulfur cluster with high confidence score. The predicted grad-CAM profile significantly overlaps with ligand-binding sites of 4Fe–4S obtained from BioLiP, as shown by the ROC curve. d grad-CAM profiles for predicted DNA binding and metal ion binding functions mapped onto the structure of an unannotated zinc finger protein (PDB Id: 1MEY) found in Escherichia coli; the corresponding ROC curves show significant overlap between the grad-CAM profile and the binding sites obtained from BioLiP.

Similar articles

Cited by

References

    1. Goodsell, D. S. The Machinery of Life (Springer Science & Business Media, 2009).
    1. Mitchell AL, et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 2018;47:D351–D360. doi: 10.1093/nar/gky1100. - DOI - PMC - PubMed
    1. Jones DT, Cozzetto D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics. 2014;31:857–863. doi: 10.1093/bioinformatics/btu744. - DOI - PMC - PubMed
    1. Dawson NL, et al. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res. 2016;45:D289–D295. doi: 10.1093/nar/gkw1098. - DOI - PMC - PubMed
    1. Gerstein M. How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold. Des. 1998;3:497–512. doi: 10.1016/S1359-0278(98)00066-2. - DOI - PubMed

Publication types

LinkOut - more resources