Efficient genome-wide association in biobanks using topic modeling identifies multiple novel disease loci

Mol Med. 2017 Nov:23:285-294. doi: 10.2119/molmed.2017.00100. Epub 2017 Aug 31.

Abstract

Biobanks and national registries represent a powerful tool for genomic discovery, but rely on diagnostic codes that may be unreliable and fail to capture the relationship between related diagnoses. We developed an efficient means of conducting genome-wide association studies using combinations of diagnostic codes from electronic health records (EHR) for 10845 participants in a biobanking program at two large academic medical centers. Specifically, we applied latent Dirichilet allocation to fit 50 disease topics based on diagnostic codes, then conducted genome-wide common-variant association for each topic. In sensitivity analysis, these results were contrasted with those obtained from traditional single-diagnosis phenome-wide association analysis, as well as those in which only a subset of diagnostic codes are included per topic. In meta-analysis across three biobank cohorts, we identified 23 disease-associated loci with p<1e-15, including previously associated autoimmune disease loci. In all cases, observed significant associations were of greater magnitude than for single phenome-wide diagnostic codes, and incorporation of less strongly-loading diagnostic codes enhanced association. This strategy provides a more efficient means of phenome-wide association in biobanks with coded clinical data.

Keywords: ICD9; biobank; cluster analysis; coded clinical data; genetic association; genome-wide association; latent dirichilet allocation; registry; topic modeling.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Biological Specimen Banks*
  • Disease
  • Genetic Variation
  • Genome-Wide Association Study*
  • Genotype
  • Humans
  • Models, Theoretical