Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Sep 1;39(17):e118.
doi: 10.1093/nar/gkr407. Epub 2011 Jul 3.

Predicting the Functional Impact of Protein Mutations: Application to Cancer Genomics

Affiliations
Free PMC article

Predicting the Functional Impact of Protein Mutations: Application to Cancer Genomics

Boris Reva et al. Nucleic Acids Res. .
Free PMC article

Abstract

As large-scale re-sequencing of genomes reveals many protein mutations, especially in human cancer tissues, prediction of their likely functional impact becomes important practical goal. Here, we introduce a new functional impact score (FIS) for amino acid residue changes using evolutionary conservation patterns. The information in these patterns is derived from aligned families and sub-families of sequence homologs within and between species using combinatorial entropy formalism. The score performs well on a large set of human protein mutations in separating disease-associated variants (∼19 200), assumed to be strongly functional, from common polymorphisms (∼35 600), assumed to be weakly functional (area under the receiver operating characteristic curve of ∼0.86). In cancer, using recurrence, multiplicity and annotation for ∼10 000 mutations in the COSMIC database, the method does well in assigning higher scores to more likely functional mutations ('drivers'). To guide experimental prioritization, we report a list of about 1000 top human cancer genes frequently mutated in one or more cancer types ranked by likely functional impact; and, an additional 1000 candidate cancer genes with rare but likely functional mutations. In addition, we estimate that at least 5% of cancer-relevant mutations involve switch of function, rather than simply loss or gain of function.

Figures

Figure 1.
Figure 1.
Schematic of the method and validation tests. The functional impact score (FIS) is derived from multiple sequence alignments of sequence homologs. The score is based on the evolutionary conservation of a mutated residue in a protein family and, separately, in each of its subfamilies. Larger scores indicate more likely functional impact of a mutation.
Figure 2.
Figure 2.
Separation of disease-associated and polymorphic variants by functional impact score. (A) Normalized smoothed distributions of the values of the functional score as computed for 19 179 known ‘disease-associated’ and 35 608 ‘common polymorphism’ variants and mutations annotated in UniProt (HUMSAVAR, release 2010_08; http://www.uniprot.org/docs/humsavar). (B) The cumulative distributions of the score values computed for disease-associated and polymorphic variants, same data as in (A). An equally balanced separation (79%) between the two variant classes is achieved at a score threshold of FIS∼1.9. At this threshold, ∼79% of all disease associated variants are scored higher than this threshold and ∼79% of all polymorphic variants are scored lower. The maximal separation (∼80.3%) between the two classes is achieved at the threshold value of 2.26; at this threshold, ∼70% of disease-associated variants are scored higher and 86% of polymorphic variants are scored lower.
Figure 3.
Figure 3.
ROC analysis of classification between disease-associated and polymorphic variants. The observed score range (−6, 6) was divided into 1000 discrete thresholds, and for each of the thresholds, percentages of disease-associated and polymorphic variants above and below the score threshold were determined. The percentage of disease associated variants above the score threshold is defined as ‘true positives’, while the percentage of polymorphic variants above the score threshold is defined as ‘false positives’. The ROC curves are built for two test sets: in the first set, all available ∼55.7 K variants (∼19.2 K disease-associated and ∼36.5 K polymorphic) were used; the scores of the variants that fall on regions with no sequence homology were taken equal to zero; in the second set, the scores for a reduced set of ∼27.4 K variants (∼13.7 K disease-associated and ∼13.6 K polymorphic) were computed using alignments of 75 or more sequences.
Figure 4.
Figure 4.
FIS distributions of mutations in TP53 binned into eight classes based on mutational impact. The normalized transcriptional activities of 2314 TP53 mutants were averaged and, depending on the average activity value, the mutations were binned into eight classes; the ranges of the average transcriptional activity are given below the bin marks. The FIS distributions are presented by the box plots; thick black lines show the medians of the distributions; each of the boxes is drawn between the lower and upper quartiles of the distributions; the dotted lines extend to the minimum and maximum values of the distributions. The mutations with larger functional impact, i.e. higher or lower than normal transcriptional activity (‘loss of function’ or ‘gain of function’) tend to have the higher values of the FIS score.
Figure 5.
Figure 5.
Cumulative score distributions computed for recurrent cancer mutations in the COSMIC database (release 49, September, 2010), the scores were computed for 10 005 unique non-synonymous point mutations affecting 3630 genes. Recurrent cancer mutations observed two or more times (1828) and highly recurrent mutations observed five or more times (712) are scoring significantly higher compared to mutations observed only once (8177); the ROC analysis (not shown) of separation of recurrent mutations from one-time-observed mutations gives AUC = 0.75; the accuracy of separation is ∼69%, when a percentage of false positives is equal to a percentage of false negatives.
Figure 6.
Figure 6.
Cumulative score distributions computed for mutations of multiply mutated genes in the COSMIC database; mutations in COSMIC are distributed non-uniformly across genes: one mutation per gene is detected in 1349 genes; two or more mutations are detected in 620 genes, three or more—in 265 genes, five or more in 96 genes, 10 or more—in 51 genes, 19 or more—in 37 genes. Multiply mutated genes (mutated two or more times) are enriched in high score mutations compared to single mutated genes and polymorphisms.
Figure 7.
Figure 7.
Cumulative score distributions computed for mutations in genes annotated as tumor suppressors and oncogenes in the COSMIC database; 4413 mutations in tumor suppressors and oncogenes are enriched in high-scoring mutations compared to 5592 mutations in genes non-annotated as TS and OG. The ROC analysis (not shown) of separation of recurrent mutations from one-time-observed mutations gives AUC = 0.6745; accuracy of separation is 64%, when the percentage of false positives is equal to the percentage of false negatives.
Figure 8.
Figure 8.
Ranking mutated genes by significance for cancer. The cancer gene ranking score (Rs), derived from information reported in the COSMIC database, is defined as Rs = log2(Nm*Nc), where Nm is a number of unique cancer-associated mutations reported in the gene, and Nc is a number of different cancer types with mutations in this gene. All analyzed 3629 genes were divided into four categories depending on presence or absence of predicted functional mutations and known association to cancer (gene is considered as cancer associated, if it is annotated as TS or OG, or it interacts with one or more of TS or OG). Cancer associated genes are enriched with predicted functional mutations (P < 10−20 in two-tail Fisher test) compared to genes with unknown cancer association. Using a reasonable cutoff, one nominates a list of 957 genes with significance for cancer (arrow). A gene is above the cut either because it is observed to be multiply mutated (Rs > 1, three or more mutations) or, for Rs = 1 (two mutations), if at least one of the mutations in the gene is predicted as functional. Detailed statistical information on mutated genes is in Supplementary Table SM2. The higher proportion of genes with at least one predicted functional mutation (orange or brown) in frequently mutated genes (peak at left) is not surprising—in fact, a fair number of these mutations have been functionally validated in the literature. A particularly interesting set of genes (998, bottom left) are those that (so far) have been observed just once (Rs = 0) but contain a mutation predicted to be functional. Such genes may be rare, but functionally significant, contributors to oncogenesis and are good candidates for experimental follow-up.
Figure 9.
Figure 9.
Functional mutation in a predicted specificity position of RAC1 (Ras-related C3 botulinum toxin substrate 1). (A) The mutation affects a residue that is conserved as A (Ala) in subfamily #1 (top sequences, close homologues of RAC1) and as E (Glu) in subfamily #2 (bottom sequences, close homologues of CDC42); Uniprot name, species identifier, residues number range and subfamily number are in left columns. The sequence subfamilies and specificity scores (vertical bars at top) were computed from a non-redundant MSA (multiple sequence alignment) of 274 sequences using CEO clustering. The mutation A95E of RAC1 has a high specificity score in RAC1. (B) The position affected by the mutation is in the binding interface of RAC1 in contact with the T-lymphoma invasion and metastasis factor 1 (Tiam1); (PDB code 1foe).

Similar articles

See all similar articles

Cited by 626 articles

See all "Cited by" articles

References

    1. Ode H, Matsuyama S, Hata M, Neya S, Kakizawa J, Sugiura W, Hoshino T. Computational characterization of structural role of the non-active site mutation M36I of human immunodeficiency virus type 1 protease. J. Mol. Biol. 2007;370:598–607. - PubMed
    1. Lorch M, Mason JM, Sessions RB, Clarke AR. Effects of mutations on the thermodynamics of a protein folding reaction: implications for the mechanism of formation of the intermediate and transition states. Biochemistry. 2000;39:3480–3485. - PubMed
    1. Lorch M, Mason JM, Clarke AR, Parker MJ. Effects of core mutations on the folding of a beta-sheet protein: implications for backbone organization in the I-state. Biochemistry. 1999;38:1377–1385. - PubMed
    1. Alfalah M, Keiser M, Leeb T, Zimmer KP, Naim HY. Compound heterozygous mutations affect protein folding and function in patients with congenital sucrase-isomaltase deficiency. Gastroenterology. 2009;136:883–892. - PubMed
    1. Koukouritaki SB, Poch MT, Henderson MC, Siddens LK, Krueger SK, VanDyke JE, Williams DE, Pajewski NM, Wang T, Hines RN. Identification and functional analysis of common human flavin-containing monooxygenase 3 genetic variants. J. Pharmacol. Exp. Ther. 2007;320:266–273. - PubMed

Publication types

Substances

Feedback