Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations

PLoS One. 2020 Aug 21;15(8):e0237911. doi: 10.1371/journal.pone.0237911. eCollection 2020.


Electronic health records (EHRs) contain rich documentation regarding disease symptoms and progression, but EHR data is challenging to use for diagnosis prediction due to its high dimensionality, relative scarcity, and substantial level of noise. We investigated how to best represent EHR data for predicting cervical cancer, a serious disease where early detection is beneficial for the outcome of treatment. A case group of 1321 patients with cervical cancer were matched to ten times as many controls, and for both groups several types of events were extracted from their EHRs. These events included clinical codes, lab results, and contents of free text notes retrieved using a LSTM neural network. Clinical events are described with great variation in EHR texts, leading to a very large feature space. Therefore, an event hierarchy inferred from the textual events was created to represent the clinical texts. Overall, the events extracted from free text notes contributed the most to the final prediction, and the hierarchy of textual events further improved performance. Four classifiers were evaluated for predicting a future cancer diagnosis where Random Forest achieved the best results with an AUC of 0.70 from a year before diagnosis up to 0.97 one day before diagnosis. We conclude that our approach is sound and had excellent discrimination at diagnosis, but only modest discrimination capacity before this point. Since our study objective was earlier disease prediction than such, we propose further work should consider extending patient histories through e.g. the integration of primary health records preceding referral to hospital.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Data Mining
  • Electronic Health Records*
  • Female
  • Humans
  • Machine Learning*
  • Neural Networks, Computer
  • Sweden
  • Uterine Cervical Neoplasms / diagnosis*

Grants and funding

Both authors (RW and KS) were funded by the Nordic Information for Action eScience Center of Excellence in Health-Related e-Sciences (NIASC, project number 62721, The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.