Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 17;12(8):1131.
doi: 10.3390/biom12081131.

An Interpretable Machine-Learning Algorithm to Predict Disordered Protein Phase Separation Based on Biophysical Interactions

Affiliations

An Interpretable Machine-Learning Algorithm to Predict Disordered Protein Phase Separation Based on Biophysical Interactions

Hao Cai et al. Biomolecules. .

Abstract

Protein phase separation is increasingly understood to be an important mechanism of biological organization and biomaterial formation. Intrinsically disordered protein regions (IDRs) are often significant drivers of protein phase separation. A number of protein phase-separation-prediction algorithms are available, with many being specific for particular classes of proteins and others providing results that are not amenable to the interpretation of the contributing biophysical interactions. Here, we describe LLPhyScore, a new predictor of IDR-driven phase separation, based on a broad set of physical interactions or features. LLPhyScore uses sequence-based statistics from the RCSB PDB database of folded structures for these interactions, and is trained on a manually curated set of phase-separation-driving proteins with different negative training sets including the PDB and human proteome. Competitive training for a variety of physical chemical interactions shows the greatest contribution of solvent contacts, disorder, hydrogen bonds, pi-pi contacts, and kinked beta-structures to the score, with electrostatics, cation-pi contacts, and the absence of a helical secondary structure also contributing. LLPhyScore has strong phase-separation-prediction recall statistics and enables a breakdown of the contribution from each physical feature to a sequence's phase-separation propensity, while recognizing the interdependence of many of these features. The tool should be a valuable resource for guiding experiments and providing hypotheses for protein function in normal and pathological states, as well as for understanding how specificity emerges in defining individual biomolecular condensates.

Keywords: biomolecular condensates; intrinsically disordered proteins; machine learning; phase separation; physical interactions; predictor.

PubMed Disclaimer

Conflict of interest statement

J.D.F.-K. is an advisor for Faze Medicines. The authors declare that this affiliation has not influenced the work reported here in any way.

Figures

Figure 1
Figure 1
Data curation workflow. A schematic diagram of how data for training were obtained and processed.
Figure 2
Figure 2
Physical-interaction- and structure-based feature extraction. An example is given of the feature representation of sequences for the sequence “GDVT” converted to the pi–pi (long-range) feature matrix.
Figure 3
Figure 3
Predictor training workflow. A schematic diagram of the steps in training is shown.
Figure 4
Figure 4
Direction of correlation of features with performance of the developing phase-separation predictor. Training curves of 16 features to reveal the direction of correlation of each feature with score. Features that rise towards AUROC = 1.0 have “positive” features; features that decline towards AUROC = 0.0 have “negative” signs.
Figure 5
Figure 5
Ranking of the importance of features to discrimination in the developing phase-separation predictor between PS-positive and PS-negative sequences. The z-score of PS-positive sequences’ individual feature values against the mean PS-negative sequences’ values is shown.
Figure 6
Figure 6
Final predictor of model performance. Performance plots of the final human + PDB model on evaluation set 1 (left, PS-positive sequences and the entire PDB proteome) and evaluation set 2 (right, PS-positive sequences and the entire human proteome). (a,d) ROC curves. (b,e) Predicted score boxplots of positive vs. negative sequences. (c,f) Distribution histograms of positive vs. negative sequences.
Figure 7
Figure 7
Comparison of three training baselines and the final human + PDB predictor model for validation. Baseline 1 was created by providing random values from a normal distribution N(0, 1) in the weight-training step instead of providing PDB-based physical-feature values to the genetic algorithm. Baseline 2 was created by providing random values from the distribution of residue-specific physical-feature values instead of providing sequence-based physical-feature values. Baseline 3 was created by optimizing 1 weight for 20 residue types for each physical feature (removing residue specificity) during training instead of optimizing 20 weights for 20 residue types for each physical feature.
Figure 8
Figure 8
Comparison of the performance of predictors trained on eight features vs. one feature for the human + PDB model. (a) ROC curves of one-feature predictors vs. the eight-feature predictor. (b) Venn diagrams showing the coverage overlaps of PS-positive sequences by one-feature predictors vs. the eight-feature predictor at a confidence threshold that returns 2% of the PDB.
Figure 9
Figure 9
Comparison of LLPhyScore (three models) with other phase-separation predictors. Relationship between percent recall and total percentage of (a) evaluation set 1 and (b) evaluation set 2 accepted at the given thresholds for PScore, catGRANULE, PLAAC, PSPredictor, FuzDrop and LLPhyScore.
Figure 10
Figure 10
Feature-score-based clustering for PS-positive proteins for the human + PDB model. (a) Plot of two abstracted dimensions for clustering based on feature z-scores, showing the separation of different types of phase-separating sequences. (b) The score breakdown of four example sequences from four distinct clusters in (a): FUS (human), Nup98 (human), elastin-like peptide (ELP, VPGVG_30, 30 repeats of VPGVG) and MEG-3 (C. elegans).
Figure 11
Figure 11
Enrichment heatmap by GO functional annotations for different features for the human + PDB model. Heatmap showing the enrichment of the proteins with a given functional annotation that fall under a 10% confidence threshold for each single-feature score and the eight-feature sum score. The color gradient shows the natural logarithm of the enrichment percentage. The black boxes indicate that no proteins in this GO term are within the top 10% of the corresponding score type.
Figure 12
Figure 12
LLPhyScore score enrichment by eight selected physical features for the PDB proteome, per residue type, for the human + PDB model. Heatmaps show the score enrichment in PDB protein sequences by each feature’s discrete values, normalized to each residue type. The color gradient shows the natural logarithm of the observed over expected ratio. Enrichment for (a) secondary structure (H, alpha-helix; E, beta-sheet; G, 310 helix; T, hydrogen-bonded turn; L, loop; S, bend; B, single-pair beta-sheet), (b) short-range pi–pi, (c) K-beta, (d) disorder, (e) short-range electrostatic, (f) long-range electrostatic, (g) protein–water and (h) protein–carbon. The color bar for all heatmaps is shown at the right.
Figure 13
Figure 13
Disordered character of PDB sequences according to the LLPhyScore of chain reference sequences. Panel (a) shows the fraction of proteins in each percentile bin of LLPhyScore for which more than 50% of the reference sequence is missing from density (protein sequence that does not show up in the structure). Panel (b) shows the disordered/irregular structural character of residues that are within the density in the structure, with blue showing the fraction of proteins in each percentile bin for which more than 50% of the observed residues have a DSSP assignment other than helix or strand, and orange shows the fraction for which more than 50% of such residues are found in stretches of at least four residues in length with no helical or sheet structure.

Similar articles

Cited by

References

    1. Banani S.F., Lee H.O., Hyman A.A., Rosen M.K. Biomolecular condensates: Organizers of cellular biochemistry. Nat. Rev. Mol. Cell Biol. 2017;18:285–298. doi: 10.1038/nrm.2017.7. - DOI - PMC - PubMed
    1. Li P., Banjade S., Cheng H.-C., Kim S., Chen B., Guo L., Llaguno M., Hollingsworth J.V., King D.S., Banani S.F. Phase transitions in the assembly of multivalent signalling proteins. Nature. 2012;483:336–340. doi: 10.1038/nature10879. - DOI - PMC - PubMed
    1. Weber S.C. Evidence for and against liquid-liquid phase separation in the nucleus. Non-Coding RNA. 2019;5:50. - PMC - PubMed
    1. Mittag T., Pappu R.V. A conceptual framework for understanding phase separation and addressing open questions and challenges. Mol. Cell. 2022;82:2201–2214. doi: 10.1016/j.molcel.2022.05.018. - DOI - PMC - PubMed
    1. Harmon T.S., Holehouse A.S., Rosen M.K., Pappu R.V. Intrinsically disordered linkers determine the interplay between phase separation and gelation in multivalent proteins. eLife. 2017;6:e30294. doi: 10.7554/eLife.30294. - DOI - PMC - PubMed

Publication types

Substances

Grants and funding