Identifying Cases of Metastatic Prostate Cancer Using Machine Learning on Electronic Health Records

AMIA Annu Symp Proc. 2018 Dec 5:2018:1498-1504. eCollection 2018.


Cancer stage is rarely captured in structured form in the electronic health record (EHR). We evaluate the performance of a classifier, trained on structured EHR data, in identifying prostate cancer patients with metastatic disease. Using EHR data for a cohort of 5,861 prostate cancer patients mapped to the Observational Health Data Sciences and Informatics (OHDSI) data model, we constructed feature vectors containing frequency counts of conditions, procedures, medications, observations and laboratory values. Staging information from the California Cancer Registry was used as the ground-truth. For identifying patients with metastatic disease, a random forest model achieved precision and recall of 0.90, 0.40 using data within 12 months of diagnosis. This compared to precision 0.33, recall 0.54 for an ICD code-based query. High-precision classifiers using hundreds of structured data elements significantly outperform ICD queries, and may assist in identifying cohorts for observational research or clinical trial matching.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • California
  • Cohort Studies
  • Electronic Health Records*
  • Humans
  • Information Storage and Retrieval / methods
  • International Classification of Diseases
  • Machine Learning*
  • Male
  • Medical Informatics
  • Neoplasm Metastasis / diagnosis
  • Neoplasm Staging / methods*
  • Proof of Concept Study
  • Prostatic Neoplasms / pathology*