Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2013;14 Suppl 3(Suppl 3):S3.
doi: 10.1186/1471-2164-14-S3-S3. Epub 2013 May 28.

Identifying Mendelian Disease Genes With the Variant Effect Scoring Tool

Affiliations
Free PMC article
Comparative Study

Identifying Mendelian Disease Genes With the Variant Effect Scoring Tool

Hannah Carter et al. BMC Genomics. .
Free PMC article

Abstract

Background: Whole exome sequencing studies identify hundreds to thousands of rare protein coding variants of ambiguous significance for human health. Computational tools are needed to accelerate the identification of specific variants and genes that contribute to human disease.

Results: We have developed the Variant Effect Scoring Tool (VEST), a supervised machine learning-based classifier, to prioritize rare missense variants with likely involvement in human disease. The VEST classifier training set comprised ~ 45,000 disease mutations from the latest Human Gene Mutation Database release and another ~45,000 high frequency (allele frequency >1%) putatively neutral missense variants from the Exome Sequencing Project. VEST outperforms some of the most popular methods for prioritizing missense variants in carefully designed holdout benchmarking experiments (VEST ROC AUC = 0.91, PolyPhen2 ROC AUC = 0.86, SIFT4.0 ROC AUC = 0.84). VEST estimates variant score p-values against a null distribution of VEST scores for neutral variants not included in the VEST training set. These p-values can be aggregated at the gene level across multiple disease exomes to rank genes for probable disease involvement. We tested the ability of an aggregate VEST gene score to identify candidate Mendelian disease genes, based on whole-exome sequencing of a small number of disease cases. We used whole-exome data for two Mendelian disorders for which the causal gene is known. Considering only genes that contained variants in all cases, the VEST gene score ranked dihydroorotate dehydrogenase (DHODH) number 2 of 2253 genes in four cases of Miller syndrome, and myosin-3 (MYH3) number 2 of 2313 genes in three cases of Freeman Sheldon syndrome.

Conclusions: Our results demonstrate the potential power gain of aggregating bioinformatics variant scores into gene-level scores and the general utility of bioinformatics in assisting the search for disease genes in large-scale exome sequencing studies. VEST is available as a stand-alone software package at http://wiki.chasmsoftware.org and is hosted by the CRAVAT web server at http://www.cravat.us.

Figures

Figure 1
Figure 1
VEST Classifier performance. Receiver Operating Characteristic (left) and precision-recall curve (right) for VEST were constructed using 5-fold gene holdout cross validation on the VEST training set. The AUC statistics for these two curves were both 0.92 indicating that the VEST classifier has good sensitivity and specificity for identifying mutations with functional consequences for protein activity.
Figure 2
Figure 2
Comparison of VEST with popular methods PolyPhen2 and SIFT4.0. Receiver Operating Characteristic (left) and precision-recall curve (right) for VEST (A), PolyPhen2 (B) and SIFT4.0 (C). The color bar for SIFT is reversed since a low SIFT score corresponds to positive class prediction. ROC AUC is 0.92, 0.85, 0.84 for VEST, PolyPhen2 and SIFT respectively. PR AUC is 0.88, 0.76, 0.72 for VEST, PolyPhen2 and SIFT respectively.
Figure 3
Figure 3
Power to detect disease genes in simulated cases of locus heterogeneity. Estimated power to detect disease genes in the presence of locus heterogeneity when A) seven, three and one exomes share disease genes B) three, two and one exomes share disease genes C) ten and one exomes share disease genes D) each of four exomes results from a distinct disease gene. In each case gene p-values acquired using both Fisher's and Stouffer's methods are compared. Power is shown for raw p-values as well as Benjamini-Hochberg adjusted p-values. The height of each bar corresponds to the number of simulations in which the gene received a p-value or adjusted p-value <0.05.
Figure 4
Figure 4
Comparison of VEST score distribution for three empirical null models. Density plots created from VEST score distributions for three empirical null models representing neutral human missense variation. Null model mutations were filtered to remove overlap with the VEST training set, then scored with the VEST classifier. The Swissprot-based null shows an enrichment for large VEST scores in the right tail, indicating predicted functional mutations.
Figure 5
Figure 5
Sensitivity of gene score to mutation count and fraction of functional mutations at different effect sizes. Power to detect disease genes was estimated using simulations in R. Mutation counts and fraction of functional mutations were varied at four different effect sizes (0.5, 1.0, 1.5 and 2.0). A distinct plot represents the results of the simulation for each effect size. The legend on the top right shows the fraction of disease mutations simulated in each gene.
Figure 6
Figure 6
Sensitivity of gene score to VEST classification error. Power simulations were repeated with an additional parameter: VEST true positive rate (TPR). Four TPRs were selected based on VEST generalization error estimates. A set of simulation is shown for each of the four points (60%, 70%, 80% and 90%). As expected, power to detect disease genes decreases as the TPR decreases.

Similar articles

See all similar articles

Cited by 110 articles

See all "Cited by" articles

References

    1. Kryukov G, Pennacchio L, Sunyaev S. Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. The American Journal of Human Genetics. 2007;80(4):727–739. doi: 10.1086/513473. - DOI - PMC - PubMed
    1. Thusberg J, Vihinen M. Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Human mutation. 2009;30(5):703–714. doi: 10.1002/humu.20938. - DOI - PubMed
    1. Cooper G, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nature Reviews Genetics. 2011;12(9):628–640. doi: 10.1038/nrg3046. - DOI - PubMed
    1. Kumar P, Henikoff S, Ng P. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature protocols. 2009;4(7):1073–1081. - PubMed
    1. Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Research. 2011;39(17):e118–e118. doi: 10.1093/nar/gkr407. - DOI - PMC - PubMed

Publication types

LinkOut - more resources

Feedback