Using pre-training and interaction modeling for ancestry-specific disease prediction using multiomics data from the UK Biobank

PLoS One. 2025 Dec 1;20(12):e0336861. doi: 10.1371/journal.pone.0336861. eCollection 2025.

Abstract

Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals, underscoring a critical gap in genetic research. Prediction models trained primarily on European ancestry often fail to generalize to diverse populations, leading to reduced accuracy and potential health disparities. Here, we assess whether incorporating interaction modeling and pretraining into disease prediction models can improve performance. We evaluated the performance of Group-LASSO INTERaction-NET (glinternet) and pretrained lasso in disease prediction focusing on diverse ancestries in the UK Biobank. Models were trained on multiomic data from White British and other ancestries and validated in a cohort of more than 96,000 individuals for 8 diseases. Of the 96 trained models, we report 16 with statistically significant incremental predictive performance in terms of ROC-AUC scores ([Formula: see text]), found for diabetes, arthritis, gall stones, cystitis, asthma, and osteoarthritis. Our findings suggest that interaction terms and pre-training can modestly improve prediction accuracy, but these effects are not consistent across all diseases. Our code is available at (https://github.com/rivas-lab/AncestryOmicsUKB).

MeSH terms

  • Biological Specimen Banks
  • Genetic Predisposition to Disease
  • Genome-Wide Association Study
  • Humans
  • Multiomics
  • UK Biobank
  • United Kingdom
  • White People / genetics