Imputing genotypes using regularized generalized linear regression models

Stat Appl Genet Mol Biol. 2014 Oct;13(5):519-29. doi: 10.1515/sagmb-2012-0044.


As genomic sequencing technologies continue to advance, researchers are furthering their understanding of the relationships between genetic variants and expressed traits. However, missing data can significantly limit the power of a genetic study. Here, the use of a regularized generalized linear model, denoted by GLMNET, is proposed to impute missing genotypes. The method aims to address certain limitations of earlier regression approaches in regards to genotype imputation, particularly the specification of the number of neighboring SNPs to be included for imputing the missing genotype. The performance of GLMNET-based method is compared to the conventional multinomial regression method and two phase-based methods: fastPHASE and BEAGLE. Two simulation scenarios are evaluated: a sparse-missing model, and a small-panel expansion model. The sparse-missing model simulates a scenario where SNPs were missing in a random fashion across the genome. In the small-panel expansion model, a set of individuals is only genotyped at a subset of the SNPs of the large panel. Each imputation method is tested in the context of two data-sets: Canadian Holstein cattle data and human HapMap CEU data. Results show that the proposed GLMNET method outperforms the other methods in the small panel expansion scenario and fastPHASE performs slightly better than the GLMNET method in the sparse-missing scenario.

MeSH terms

  • Genotype*
  • Linear Models
  • Models, Theoretical*