Comparison of mixed model based approaches for correcting for population substructure with application to extreme phenotype sampling

BMC Genomics. 2022 Feb 4;23(1):98. doi: 10.1186/s12864-022-08297-y.


Background: Mixed models are used to correct for confounding due to population stratification and hidden relatedness in genome-wide association studies. This class of models includes linear mixed models and generalized linear mixed models. Existing mixed model approaches to correct for population substructure have been previously investigated with both continuous and case-control response variables. However, they have not been investigated in the context of extreme phenotype sampling (EPS), where genetic covariates are only collected on samples having extreme response variable values. In this work, we compare the performance of existing binary trait mixed model approaches (GMMAT, LEAP and CARAT) on EPS data. Since linear mixed models are commonly used even with binary traits, we also evaluate the performance of a popular linear mixed model implementation (GEMMA).

Results: We used simulation studies to estimate the type I error rate and power of all approaches assuming a population with substructure. Our simulation results show that for a common candidate variant, both LEAP and GMMAT control the type I error rate while CARAT's rate remains inflated. We applied all methods to a real dataset from a Québec, Canada, case-control study that is known to have population substructure. We observe similar type I error control with the analysis on the Québec dataset. For rare variants, the false positive rate remains inflated even after correction with mixed model approaches. For methods that control the type I error rate, the estimated power is comparable.

Conclusions: The methods compared in this study differ in their type I error control. Therefore, when data are from an EPS study, care should be taken to ensure that the models underlying the methodology are suitable to the sampling strategy and to the minor allele frequency of the candidate SNPs.

Keywords: Extreme phenotype sampling; Generalized linear mixed models; Genome-wide association study; Population stratification; Type 1 error.

MeSH terms

  • Case-Control Studies
  • Computer Simulation
  • Genome-Wide Association Study*
  • Linear Models
  • Models, Genetic*
  • Phenotype
  • Polymorphism, Single Nucleotide