Computational predictions have become indispensable for evaluating the disease-related impact of nonsynonymous single-nucleotide variants discovered in exome sequencing. Many such methods have their roots in molecular evolution, as they use information derived from multiple sequence alignments. We show that the performance of current methods (e.g., PolyPhen-2 and SIFT) is improved significantly by optimizing their statistical models on evolutionarily balanced training data, where equal numbers of positive and negative controls within each evolutionary conservation class are used. Evolutionary balancing significantly reduces the false-positive rates for variants observed at highly conserved sites and false-negative rates for variants observed at fast evolving sites. Use of these improved methods enables more accurate forecasting when concordant diagnosis from multiple methods is regarded as a more reliable indicator of the prediction. Applied to a large exome variation data set, we find that the current methods produce concordant predictions for less than half of the population variants. These advances are implemented in a web resource for use in practical applications (www.mypeg.info, last accessed March 13, 2013).
Keywords: computational prediction; evolutionary medicine; nonsynonymous single nucleotide variant.