On the sparsity of fitness functions and implications for learning
- PMID: 34937698
- PMCID: PMC8740588
- DOI: 10.1073/pnas.2109649118
On the sparsity of fitness functions and implications for learning
Abstract
Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model's interpretable parameters-sequence length, alphabet size, and assumed interactions between sequence positions-on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.
Keywords: compressed sensing; epistasis; fitness functions; protein structure.
Copyright © 2021 the Author(s). Published by PNAS.
Conflict of interest statement
Competing interest statement: J.L. is on the Scientific Advisory Board for Foresite Labs and Patch Biosciences.
Figures
Similar articles
-
Sparsity estimation from compressive projections via sparse random matrices.EURASIP J Adv Signal Process. 2018;2018(1):56. doi: 10.1186/s13634-018-0578-0. Epub 2018 Sep 10. EURASIP J Adv Signal Process. 2018. PMID: 30956656 Free PMC article.
-
Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions.Nat Commun. 2021 Sep 1;12(1):5225. doi: 10.1038/s41467-021-25371-3. Nat Commun. 2021. PMID: 34471113 Free PMC article.
-
Should evolutionary geneticists worry about higher-order epistasis?Curr Opin Genet Dev. 2013 Dec;23(6):700-7. doi: 10.1016/j.gde.2013.10.007. Epub 2013 Nov 27. Curr Opin Genet Dev. 2013. PMID: 24290990 Free PMC article. Review.
-
Estimation of white matter fiber parameters from compressed multiresolution diffusion MRI using sparse Bayesian learning.Neuroimage. 2018 Feb 15;167:488-503. doi: 10.1016/j.neuroimage.2017.06.052. Epub 2017 Jun 29. Neuroimage. 2018. PMID: 28669918 Free PMC article.
-
Perspective: Sign epistasis and genetic constraint on evolutionary trajectories.Evolution. 2005 Jun;59(6):1165-74. Evolution. 2005. PMID: 16050094 Review.
Cited by
-
The simplicity of protein sequence-function relationships.Nat Commun. 2024 Sep 11;15(1):7953. doi: 10.1038/s41467-024-51895-5. Nat Commun. 2024. PMID: 39261454 Free PMC article.
-
Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering.ACS Cent Sci. 2024 Feb 5;10(2):226-241. doi: 10.1021/acscentsci.3c01275. eCollection 2024 Feb 28. ACS Cent Sci. 2024. PMID: 38435522 Free PMC article. Review.
-
The simplicity of protein sequence-function relationships.bioRxiv [Preprint]. 2024 Feb 7:2023.09.02.556057. doi: 10.1101/2023.09.02.556057. bioRxiv. 2024. Update in: Nat Commun. 2024 Sep 11;15(1):7953. doi: 10.1038/s41467-024-51895-5 PMID: 37732229 Free PMC article. Updated. Preprint.
-
Epistasis facilitates functional evolution in an ancient transcription factor.Elife. 2024 May 20;12:RP88737. doi: 10.7554/eLife.88737. Elife. 2024. PMID: 38767330 Free PMC article.
-
Conformal prediction under feedback covariate shift for biomolecular design.Proc Natl Acad Sci U S A. 2022 Oct 25;119(43):e2204569119. doi: 10.1073/pnas.2204569119. Epub 2022 Oct 18. Proc Natl Acad Sci U S A. 2022. PMID: 36256807 Free PMC article.
References
-
- Ballal A., et al. ., Sparse epistatic patterns in the evolution of terpene synthases. Mol. Biol. Evol. 37, 1907–1924 (2020). - PubMed
