Performance and robustness of penalized and unpenalized methods for genetic prediction of complex human disease

Genet Epidemiol. 2013 Feb;37(2):184-95. doi: 10.1002/gepi.21698. Epub 2012 Nov 30.


A central goal of medical genetics is to accurately predict complex disease from genotypes. Here, we present a comprehensive analysis of simulated and real data using lasso and elastic-net penalized support-vector machine models, a mixed-effects linear model, a polygenic score, and unpenalized logistic regression. In simulation, the sparse penalized models achieved lower false-positive rates and higher precision than the other methods for detecting causal SNPs. The common practice of prefiltering SNP lists for subsequent penalized modeling was examined and shown to substantially reduce the ability to recover the causal SNPs. Using genome-wide SNP profiles across eight complex diseases within cross-validation, lasso and elastic-net models achieved substantially better predictive ability in celiac disease, type 1 diabetes, and Crohn's disease, and had equivalent predictive ability in the rest, with the results in celiac disease strongly replicating between independent datasets. We investigated the effect of linkage disequilibrium on the predictive models, showing that the penalized methods leverage this information to their advantage, compared with methods that assume SNP independence. Our findings show that sparse penalized approaches are robust across different disease architectures, producing as good as or better phenotype predictions and variance explained. This has fundamental ramifications for the selection and future development of methods to genetically predict human disease.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Arthritis, Rheumatoid / genetics
  • Case-Control Studies
  • Celiac Disease / genetics
  • Coronary Artery Disease / genetics
  • Crohn Disease / genetics
  • Diabetes Mellitus, Type 1 / genetics
  • Disease / genetics*
  • Genome-Wide Association Study
  • Humans
  • Linkage Disequilibrium
  • Logistic Models
  • Models, Genetic*
  • Multifactorial Inheritance*
  • Polymorphism, Single Nucleotide*
  • Reproducibility of Results