Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 20;12(10):969-982.e6.
doi: 10.1016/j.cels.2021.08.010. Epub 2021 Oct 9.

D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions

Affiliations

D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions

Samuel Sledzieski et al. Cell Syst. .

Abstract

We combine advances in neural language modeling and structurally motivated design to develop D-SCRIPT, an interpretable and generalizable deep-learning model, which predicts interaction between two proteins using only their sequence and maintains high accuracy with limited training data and across species. We show that a D-SCRIPT model trained on 38,345 human PPIs enables significantly improved functional characterization of fly proteins compared with the state-of-the-art approach. Evaluating the same D-SCRIPT model on protein complexes with known 3D structure, we find that the inter-protein contact map output by D-SCRIPT has significant overlap with the ground truth. We apply D-SCRIPT to screen for PPIs in cow (Bos taurus) at a genome-wide scale and focusing on rumen physiology, identify functional gene modules related to metabolism and immune response. The predicted interactions can then be leveraged for function prediction at scale, addressing the genome-to-phenome challenge, especially in species where little data are available.

Keywords: cow rumen; deep learning; embedding; function prediction; genome to phenome; interpretability; language models; metabolism; module detection; protein-protein interaction.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests The authors declare no competing interests.

Figures

Figure 1:
Figure 1:. D-SCRIPT Motivation and Workflow.
We demonstrate how D-SCRIPT can be used genome-wide to predict a complete PPI network in the fly. (a) Experimentally derived PPI data is scarce in species outside of human and yeast, even when normalized for size of the genome, (sourced from BioGRID, STAR Methods), (b) A D-SCRIPT model, after being trained on a large corpus of human PPI data, can be broadly applied to a species of interest even if little PPI data is available in that species. For each pair of proteins in the target species, D-SCRIPT converts the pair of protein sequences into a score representing probability of interaction. Because D-SCRIPT scales to large numbers of protein pairs and maintains performance across species, it can be used to score all protein pairs genome-wide to predict a synthetic PPI network in the species, facilitating a genome-to-phenome translation, (c) Blowup detail of the D-SCRIPT architecture from the box in (b) (Figure 2 for more detail). D-SCRIPT generalizes due to its structurally motivated design. The pre-trained language model generates structural features for a single protein, while the projection and convolution model the interaction between every pair of residues in the candidate pair. In the final layer, we introduce a magnitude regularization term to ensure the prediction of an inter-protein contact map that is structurally plausible.
Figure 2:
Figure 2:. D-SCRIPT Architecture.
Left to right: The Pre-trained Embedding Model, a deep learning language model from Bepler and Berger, generates features for each individual protein. The Projection Module reduces them to d dimensions. Each low-dimensional single-protein embedding implicitly includes, among other features, an encoding that broadly captures the protein’s residue-contact map (Figure 5). The Contact Module combines these low-dimensional embeddings to compute a sparse inter-protein contact map through a two-step process which first computes a representation for each pair of residues, then incorporates local information using a convolutional filter. Finally, the Interaction Prediction Module uses a customized max-pooling operation to predict the probability of interaction between the input proteins.
Figure 3:
Figure 3:. Improved Protein Functional Characterization using D-SCRIPT Modules.
D-SCRIPT recovers more functionally coherent clusters than PIPR (p = 0.000723, one-tailed t-test). 384 (374) protein clusters were generated by evaluating 10,475,595 candidate protein-pairs with D-SCRIPT (PIPR). We computed the diffusion state distance (DSD) between all proteins, clustered the DSD matrix using spectral clustering, filtered out small (< 3) clusters, and recursively split large (> 100) clusters. Within-cluster similarity was calculated as the average Jaccard similarity between GO Slim annotations of all pairs of proteins in the cluster. See also Figure S2.
Figure 4:
Figure 4:. Protein Interaction Network in Bovine Rumen.
We applied D-SCRIPT to predict a de novo PPI network in cow (B. taurus) and investigated specifically the functional modules likely to be active in the cow rumen (a,b,c,d). After evaluating 50 million candidate protein-pairs, we generated a network of 476,399 predicted PPIs between 17,811 proteins and performed spectral clustering on the diffusion state distance (DSD) matrix of the network to identify functional modules, shortlisting five modules related to rumen physiology. A recent RNA-seq study validates several proteins in these modules as being strongly overexpressed in rumen tissue. For each module, we report gene ontology molecular function (GO:MF), biological process (GO:BP) and cellular compartment (GO:CC) annotations which are significantly enriched for the proteins in each cluster, computed using g:Profiler. We also show the log(fold change) for genes in the cluster which are more expressed in rumen tissue than on average across all tissues. For each module, nodes have been added in gray if necessary to fully connect all nodes. We find 3 modules containing members of the PRD-SPRRII family and which are enriched for phosphate and mitochondrial metabolism (a,b) and regulation of cell growth mechanisms (c). We also find a module with TCHH-like 2 proteins enriched for immune response (d), and with S100-A2 and S100-A12 proteins enriched for transcriptional regulation and chromatin organization (e). The modules in a,b and e are directly connected through TARS and MRPL4, which suggest a link between these functions in bovine rumen, (f) We demonstrate that protein pairs with a predicted D-SCRIPT edge correspond to a significantly higher coexpression between their respective genes (one-sided Welch’s t-test). This coexpression signal gets even stronger when evaluated only on protein pairs in a functional module, suggesting that both the protein network and functional modules are biologically meaningful. See also Figure S4 and Figure S5.
Figure 5:
Figure 5:. D-SCRIPT Embeddings Represent Structure and Interaction.
After a full model has been trained to predict interaction, the low dimensional embeddings learned by the projection module of D-SCRIPT can be used as meaningful representations of the protein in other applications. (a,b,c,d,e) The PDB identifier 1GNG corresponds to a protein with 356 residues where the accuracy of using the D-SCRIPT embedding to predict self-contacts is near the median of cases we studied (AUPR=0.19), while 1CGI corresponds to a short protein (54 residues) in which the embedding achieves a higher accuracy (AUPR=0.38). On a set of 300 PDB structures, we assessed contacts at 8 Å (a, c) and, using a training set of 100 structures, trained a logistic regression to predict contacts (b, d) for the remaining structures. The binarization thresholds for panel (e) were chosen so as to result in the same number of contacts as in the original maps. (f,g,h) D-SCRIPT embeddings also enable the accurate recovery of true interacting protein pairs in the neighborhood of known PPIs in human (f), yeast (g), and roundworm (h). D-SCRIPT embeddings recover more interacting proteins than any other embedding, regardless of species or number of neighbors checked. AAClass also performs well, likely because it characterizes biochemistry which is preserved at longer evolutionary distances. BLAST performs well at low values of k but has difficulty recovering interactions for larger values — likely due to network rewiring over longer evolutionary distances.
Figure 6:
Figure 6:. D-SCRIPT Predicts Biologically Meaningful Contact Maps.
We show inter-protein contact maps of protein structures known to dock together (Hwang et al., 2010). Panels (a,b) correspond to pairs where D-SCRIPT correctly predicted an interaction, while panels (c,d) are cases where it incorrectly predicted no interaction. The black-and-white matrices correspond to the PDB ground truth while the colored matrices correspond to D-SCRIPT’s predicted contat map C^; for the latter, the color scales of (a,b) differ from (c,d). While C^ contains some large values for positive pairs, its maximum Cmax is very low for negative pairs. Panel (e) shows a violin plot of a systematic evaluation (295 protein pairs, each with 500 bootstrap samples to generate the p-value) of the 2-D Earth mover’s distance-based similarity between C^ and the ground truth. Not only are the C^s of correctly-predicted pairs substantially similar to the ground truth (median FDR-corrected q = 0.08, one-sided t-test), even when D-SCRIPT incorrectly predicts two proteins don’t interact, its contact maps are still similar to ground truth. PDB Identifiers: a) 2J7P (A/D), b) 1NW9 (B/A), c) 3H2V (A/E), d) 1F51 (A/E).

Comment in

Similar articles

Cited by

References

    1. Adams MD et al. (2000) ‘The genome sequence of Drosophila melanogaster’, Science, 287(5461), pp. 2185–2195. - PubMed
    1. Alborzi SZ, Ritchie DW and Devignes M-D (2018) ‘Computational discovery of direct associations between GO terms and protein domains’, BMC Bioinformatics, 19(14), p. 413. doi: 10.1186/s12859-018-2380-2. - DOI - PMC - PubMed
    1. Alonso A et al. (2004) ‘Protein tyrosine phosphatases in the human genome’, Cell, 117(6), pp. 699–711. - PubMed
    1. Alonso A and Pulido R (2016) ‘The extended human PTP ome: A growing tyrosine phosphatase family’, The FEBS journal, 283(8), pp. 1404–1429. - PubMed
    1. Altschul SF et al. (1990) ‘Basic local alignment search tool’, Journal of molecular biology, 215(3), pp. 403–410. - PubMed

Publication types

LinkOut - more resources