Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb;35(2):128-135.
doi: 10.1038/nbt.3769. Epub 2017 Jan 16.

Mutation effects predicted from sequence co-variation

Affiliations

Mutation effects predicted from sequence co-variation

Thomas A Hopf et al. Nat Biotechnol. 2017 Feb.

Abstract

Many high-throughput experimental technologies have been developed to assess the effects of large numbers of mutations (variation) on phenotypes. However, designing functional assays for these methods is challenging, and systematic testing of all combinations is impossible, so robust methods to predict the effects of genetic variation are needed. Most prediction methods exploit evolutionary sequence conservation but do not consider the interdependencies of residues or bases. We present EVmutation, an unsupervised statistical method for predicting the effects of mutations that explicitly captures residue dependencies between positions. We validate EVmutation by comparing its predictions with outcomes of high-throughput mutagenesis experiments and measurements of human disease mutations and show that it outperforms methods that do not account for epistasis. EVmutation can be used to assess the quantitative effects of mutations in genes of any organism. We provide pre-computed predictions for ∼7,000 human proteins at http://evmutation.org/.

PubMed Disclaimer

Conflict of interest statement

Competing Financial Interests Statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Inferring context-dependent effects of mutations from sequences
Evolution has generated diverse families of proteins and RNAs with varied sequences that perform a common function. An unsupervised probabilistic model trained to generate the natural diversity in a multiple sequence alignment of a family can be used to predict the relative favorability of unseen mutations. Left: Existing models describe functional constraints on each position i in a sequence σ independently, averaging over the effect of background positions j. This can lead to incorrect predictions of neutrality. Right: Our approach infers a global probability model with pairwise interactions between positions i and j (Jij, see Methods) as well as background biases at single positions (hi). For a more detailed graphical schematic of the calculation, see Supplementary Fig. 1.
Figure 2
Figure 2. Saturation mutagenesis experiments provide a quantitative test of context-dependent predictions
The computed ΔE mutational landscape of the DNA methyltransferase M.HaeIII (left, colour range from 5th percentile to 0) agrees quantitatively with experimental measurements of M.HaeIII fitness under selection for cleaving activity of a restriction enzyme (right, ρ=0.69, N=1634; marginal distributions in orange). The average mutational sensitivity per position shows improved correlation beyond individual effects (ρ=0.80, N=304).
Figure 3
Figure 3. ΔE captures experimental fitness landscapes and identifies deleterious human variants
(a) Computed effects of specific mutations (difference in evolutionary statistical energy ΔE) based on the epistatic model agree with diverse experimental measurements of fitness and molecular function for 34 experiments for 20 proteins, a protein complex and an RNA molecule (underlined) as measured by Spearman’s rank correlation coefficient ρ (for equivalent site average plot, see Supplementary Figure 2; for correlations across all different assays tested in the experiments, see Supplementary Fig. 4). (b) Evolutionary statistical energies ΔE distinguish human disease-associated variants from common alleles in the population. This separation increases with the minimum allele frequency (AF) of the variants assumed to be neutral (area under the ROC curve (AUC)=0.92 for AF≥0.1, AUC=0.94 for AF≥0.25, AUC=0.96 for AF≥0.5). (c) The epistatic model shows stronger agreement with experiments than the established methods SIFT and PolyPhen-2, a baseline model based on the BLOSUM62 substitution matrix, and a corresponding independent model without pairwise interactions (differences of ρ of more than 0.6 were included in the bin at 0.6).
Figure 4
Figure 4. Improvements of the epistatic model for functional sites
(a) Left: The RNA-binding residue H172 of PolyA-binding protein (PABP) is strongly coupled to other residues in the binding interface that are close in 3D. Right: The epistatic coupling leads to strong constraints on acceptable amino acids in position 172, as observed in an experimental mutation scan of PABP. Only the epistatic model correctly identifies these co-constraints, while a model without sequence context (independent model) suggests many more substitutions would be acceptable (range of experimental preferences scaled to range of predicted preferences based on full set of mutants for entire domain). (b) Positions in PABP for which prediction accuracy improves the most by considering epistasis (≥2σ difference in root mean squared prediction error, spheres) cluster around the RNA ligand (yellow sticks, PDB: 4f02). (c) For seven high-throughput datasets where the correlation ρ of the epistatic and independent models differs more than 0.05, the epistatic model is more accurate overall (1st column), specifically for the effects of mutations of residues in interaction and ligand-binding sites (2nd column), where the residue mutation is rare versus frequent in the evolutionary sequence alignment (3rd and 4th columns, and where the residue change is damaging versus neutral in the experiment (5th and 6th columns) (Methods and Supplementary Table 8).
Figure 5
Figure 5. Computational predictions complement experimental measurements
Various molecular phenotypes (center) such as structure, thermostability, activity, and ligand-binding affinity are determined by genotype and contribute to fitness in a complicated manner that is not known a priori. However, the distribution of contemporary genotypes (left) provides a record of historical fitness values (right) which can roughly be inferred by computational methods. Identifying those phenotypes that connect to inferred fitness may shed light on which molecular phenotypes have historically been the most relevant to the organism.

Comment in

Similar articles

Cited by

References

    1. Miersch S, Sidhu SS. Intracellular targeting with engineered proteins. F1000Res. 2016;5 - PMC - PubMed
    1. Boeke JD, et al. GENOME ENGINEERING. The Genome Project-Write. Science. 2016;353:126–127. - PubMed
    1. Ostrov N, et al. Design, synthesis, and testing toward a 57-codon genome. Science. 2016;353:819–822. - PubMed
    1. Romero PA, Tran TM, Abate AR. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proceedings of the National Academy of Sciences of the United States of America. 2015;112:7159–7164. - PMC - PubMed
    1. Currin A, Swainston N, Day PJ, Kell DB. Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently. Chem Soc Rev. 2015;44:1172–1239. - PMC - PubMed

Publication types

MeSH terms