Unsupervised Machine Learning for the Discovery of Latent Clusters in COVID-19 Patients Using Electronic Health Records

Stud Health Technol Inform. 2020 Jun 26;272:1-4. doi: 10.3233/SHTI200478.


The goal of this paper was to apply unsupervised machine learning techniques towards the discovery of latent clusters in COVID-19 patients. Over 6,000 adult patients tested positive for the SARS-CoV-2 infection at the Mount Sinai Health System in New York, USA met the inclusion criteria for analysis. Patients' diagnoses were mapped onto chronicity and one of the 18 body systems, and the optimal number of clusters was determined using K-means algorithm and the elbow method. 4 clusters were identified; the most frequently associated comorbidities involved infectious, respiratory, cardiovascular, endocrine, and genitourinary disorders, as well as socioeconomic factors that influence health status and contact with health services. These results offer a strong direction for future research and more granular analysis.

Keywords: Big Data Analytics; Unsupervised Machine Learning.

MeSH terms

  • Betacoronavirus*
  • COVID-19
  • Coronavirus Infections*
  • Electronic Health Records*
  • Humans
  • New York
  • Pandemics*
  • Pneumonia, Viral*
  • SARS-CoV-2
  • Unsupervised Machine Learning*