Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 20;125(19):5022-5034.
doi: 10.1021/acs.jpcb.1c02081. Epub 2021 May 11.

UMAP as a Dimensionality Reduction Tool for Molecular Dynamics Simulations of Biomacromolecules: A Comparison Study

Affiliations

UMAP as a Dimensionality Reduction Tool for Molecular Dynamics Simulations of Biomacromolecules: A Comparison Study

Francesco Trozzi et al. J Phys Chem B. .

Abstract

Proteins are the molecular machines of life. The multitude of possible conformations that proteins can adopt determines their free-energy landscapes. However, the inherently high dimensionality of a protein free-energy landscape poses a challenge to deciphering how proteins perform their functions. For this reason, dimensionality reduction is an active field of research for molecular biologists. The uniform manifold approximation and projection (UMAP) is a dimensionality reduction method based on a fuzzy topological analysis of data. In the present study, the performance of UMAP is compared with that of other popular dimensionality reduction methods such as t-distributed stochastic neighbor embedding (t-SNE), principal component analysis (PCA), and time-structure independent components analysis (tICA) in the context of analyzing molecular dynamics simulations of the circadian clock protein VIVID. A good dimensionality reduction method should accurately represent the data structure on the projected components. The comparison of the raw high-dimensional data with the projections obtained using different dimensionality reduction methods based on various metrics showed that UMAP has superior performance when compared with linear reduction methods (PCA and tICA) and has competitive performance and scalable computational cost.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Pearson correlation analysis of UMAP, t-SNE, PCA, and tICA calculated based on the 2D reduced representations. A) Pearson correlations values between projected and high-dimensional trajectories. B) Scatterplots where the X-axis represent the distances in the high dimensional space, while the Y-axis represent the distances in the low dimensional space. The coloring represents the agreement between the original and projected distances. Red and blue represent agreement and disagreement, respectively.
Figure 2.
Figure 2.
Averaged RMSD of 1000 microstates for various 2D representations and Cartesian coordinates. Microstates were sorted based on the average RMSD values.
Figure 3.
Figure 3.
Selection of number of macrostate based on cluster RMSD. A) Heatmap of RMSD within each state. B) Violin plot of RMSD values within states (blue) and inter states (orange).
Figure 4.
Figure 4.
Similarity, expressed in percentage, between cluster populations in low-dimensional representations and high-dimensional Cartesian space.
Figure 5.
Figure 5.
Comparison of silhouette coefficient for UMAP, t-SNE, PCA, and tICA projections vs Cartesian space results. Bar heights represent the deviation in coefficient from the Cartesian case. Positive values represent higher separation of the clusters in the projected space. Negative values represent overcrowding of the clusters in projected spaces.
Figure 6.
Figure 6.
Machine learning prediction accuracy of the different macrostates based on the 2D input of the low-dimensional representation using Random Forest.
Figure 7.
Figure 7.
Comparison of implied timescales of different methods.
Figure 8.
Figure 8.
Heatmap representation of divergence of the different transition matrices obtained using different dimensionality reduction methods from the high dimensional transition matrix.
Figure 9.
Figure 9.
Performance of different methods regarding the number of components used in projection. A) Pearson correlation between high dimensional representation and reduced representation of the data at varying number of projected dimensions. B) Transition matrices error between high dimensional representation and reduced representation of the data at varying number of projected dimensions.
Figure 10.
Figure 10.
Benchmark using different dimensionality reduction methods. A) Time in seconds required for dimensionality reduction at various numbers of projected dimensions. B) Time in seconds required for 2D projections using different number of frames as data points.
Figure 11.
Figure 11.
Demonstration of protein function analysis using UMAP method. A) UMAP 2D projection. Reduced space was clustered in 16 macrostates according to the criteria presented above. The clusters were color coded based on their population. Dark states are blue, and light states are red. Dashed line represents division between dark and light areas. Arrows represent pathway for allosteric conversion from fully dark to fully light states. B) Population states analysis of the macrostates involved in VVD allosteric process. C) Visualization of the four representative states involved in the allosteric process.

Similar articles

Cited by

References

    1. Joshi T; Xu D Quantitative Assessment of Relationship between Sequence Similarity and Function Similarity. BMC Genomics 2007, 8 (1), 1–10. 10.1186/1471-2164-8-222. - DOI - PMC - PubMed
    1. Fowler DM; Araya CL; Fleishman SJ; Kellogg EH; Stephany JJ; Baker D; Fields S High-Resolution Mapping of Protein Sequence-Function Relationships. Nat. Methods 2010, 7 (9), 741–746. 10.1038/nmeth.1492. - DOI - PMC - PubMed
    1. Hegyi H; Gerstein M The Relationship between Protein Structure and Function: A Comprehensive Survey with Application to the Yeast Genome. J. Mol. Biol 1999, 288 (1), 147–164. 10.1006/jmbi.1999.2661. - DOI - PubMed
    1. Orengo CA; Todd AE; Thornton JM From Protein Structure to Function. Curr. Opin. Struct. Biol 1999, 9 (3), 374–382. 10.1016/S0959-440X(99)80051-7. - DOI - PubMed
    1. Hensen U; Meyer T; Haas J; Rex R; Vriend G; Grubmüller H Exploring Protein Dynamics Space: The Dynasome as the Missing Link between Protein Structure and Function. PLoS One 2012, 7 (5), e33931. 10.1371/journal.pone.0033931. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources