Detecting cardiovascular diseases using unsupervised machine learning clustering based on electronic medical records

BMC Med Res Methodol. 2024 Dec 19;24(1):309. doi: 10.1186/s12874-024-02422-z.

Abstract

Background: Electronic medical records (EMR)-trained machine learning models have the potential in CVD risk prediction by integrating a range of medical data from patients, facilitate timely diagnosis and classification of CVDs. We tested the hypothesis that unsupervised ML approach utilizing EMR could be used to develop a new model for detecting prevalent CVD in clinical settings.

Methods: We included 155,894 patients (aged ≥ 18 years) discharged between January 2014 and July 2022, from Xuhui Hospital, Shanghai, China, including 64,916 CVD cases and 90,979 non-CVD cases. K-means clustering was used to generate the clustering models with k = 2, 4, and 8 as predetermined number of clusters k = 2, 4, and 8. Bayesian theorem was used to estimate the models' predictive accuracy.

Results: The overall predictive accuracy of the 2-, 4-, and 8-classification clustering models in the training set was 0.856, 0.8634, and 0.8506, respectively. Similarly, the predictive accuracy of the 2-, 4-, and 8-classification clustering models in the testing set was 0.8598, 0.8659, and 0.8525, respectively. After reducing from 19 dimensions to 2 dimensions by principal component analysis, significant separation was observed for CVD cases and non-CVD cases in both training and testing sets.

Conclusion: Our findings indicate that the utilization of EMR data can support the development of a robust model for CVD detection through an unsupervised ML approach. Further investigation using longitudinal design is needed to refine the model for its applications in clinical settings.

Keywords: Bayesian theorem; Cardiovascular diseases; EMR; K-means clustering; Machine learning.

MeSH terms

  • Adult
  • Aged
  • Bayes Theorem*
  • Cardiovascular Diseases* / diagnosis
  • China / epidemiology
  • Cluster Analysis
  • Electronic Health Records* / statistics & numerical data
  • Female
  • Humans
  • Machine Learning
  • Male
  • Middle Aged
  • Unsupervised Machine Learning*