A Representativeness-informed Model for Research Record Selection from Electronic Medical Record Systems

AMIA Annu Symp Proc. 2023 Apr 29:2022:259-268. eCollection 2022.


Scientific and clinical studies have a long history of bias in recruitment of underprivileged and minority populations. This underrepresentation leads to inaccurate, inapplicable, and non-generalizable results. Electronic medical record (EMR) systems, which now drive much research, often poorly represent these groups. We introduce a method for quantifying representativeness using information theoretic measures and an algorithmic approach to select a more representative record cohort than random selection when resource limitations preclude researchers from reviewing every record in the database. We apply this method to select cohorts of 2,000-20,000 records from a large (2M+ records) EMR database at the Vanderbilt University Medical Center and assess representativeness based on age, ethnicity, race, and gender. Compared to random selection - which will on average mirror the EMR database demographics - we find that a representativeness-informed approach can compose a cohort of records that is approximately 5.8 times more representative.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.
  • Research Support, N.I.H., Extramural

MeSH terms

  • Data Management*
  • Databases, Factual
  • Electronic Health Records*
  • Humans
  • Software