Disease progression subtype discovery from longitudinal EMR data with a majority of missing values and unknown initial time points

AMIA Annu Symp Proc. 2014 Nov 14:2014:709-18. eCollection 2014.


Electronic medical records (EMR) contain a longitudinal collection of laboratory data that contains valuable phenotypic information on disease progression of a large collection of patients. These data can be potentially used in medical research or patient care; finding disease progression subtypes is a particularly important application. There are, however, two significant difficulties in utilizing this data for statistical analysis: (a) a large proportion of data is missing and (b) patients are in very different stages of disease progression and there are no well-defined start points of the time series. We present a Bayesian machine learning model that overcomes these difficulties. The method can use highly incomplete time-series measurement of varying lengths, it aligns together similar trajectories in different phases and is capable of finding consistent disease progression subtypes. We demonstrate the method on finding chronic kidney disease progression subtypes.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Adult
  • Aged
  • Artificial Intelligence*
  • Bayes Theorem
  • Disease Progression*
  • Electronic Health Records*
  • Female
  • Glomerular Filtration Rate
  • Humans
  • Information Storage and Retrieval
  • International Classification of Diseases
  • Male
  • Middle Aged
  • Renal Insufficiency, Chronic* / physiopathology

Grants and funding