Classification of the Disposition of Patients Hospitalized with COVID-19: Reading Discharge Summaries Using Natural Language Processing

Marta Fernandes; Haoqi Sun; Aayushee Jain; Haitham S Alabsi; Laura N Brenner; Elissa Ye; Wendong Ge; Sarah I Collens; Michael J Leone; Sudeshna Das; Gregory K Robbins; Shibani S Mukerji; M Brandon Westover

doi:10.2196/25457

Classification of the Disposition of Patients Hospitalized with COVID-19: Reading Discharge Summaries Using Natural Language Processing

JMIR Med Inform. 2021 Feb 10;9(2):e25457. doi: 10.2196/25457.

Authors

Marta Fernandes^#^{1

2

3}, Haoqi Sun^#^{1

2

3}, Aayushee Jain^{1

2}, Haitham S Alabsi^{1

3}, Laura N Brenner^{3

4

5}, Elissa Ye^{1

2}, Wendong Ge^{1

2

3}, Sarah I Collens¹, Michael J Leone¹, Sudeshna Das^{1

3}, Gregory K Robbins^#^{3

6}, Shibani S Mukerji^#^{1

3}, M Brandon Westover^#^{1

2

3

7}

Affiliations

¹ Department of Neurology, Massachusetts General Hospital, Boston, MA, United States.
² Clinical Data Animation Center, Boston, MA, United States.
³ Harvard Medical School, Boston, MA, United States.
⁴ Division of Pulmonary and Critical Care Medicine, Massachusetts General Hospital, Boston, MA, United States.
⁵ Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, United States.
⁶ Division of Infectious Diseases, Massachusetts General Hospital, Boston, MA, United States.
⁷ McCance Center for Brain Health, Massachusetts General Hospital, Boston, MA, United States.

^# Contributed equally.

PMID: 33449908
PMCID: PMC7879729
DOI: 10.2196/25457

Abstract

Background: Medical notes are a rich source of patient data; however, the nature of unstructured text has largely precluded the use of these data for large retrospective analyses. Transforming clinical text into structured data can enable large-scale research studies with electronic health records (EHR) data. Natural language processing (NLP) can be used for text information retrieval, reducing the need for labor-intensive chart review. Here we present an application of NLP to large-scale analysis of medical records at 2 large hospitals for patients hospitalized with COVID-19.

Objective: Our study goal was to develop an NLP pipeline to classify the discharge disposition (home, inpatient rehabilitation, skilled nursing inpatient facility [SNIF], and death) of patients hospitalized with COVID-19 based on hospital discharge summary notes.

Methods: Text mining and feature engineering were applied to unstructured text from hospital discharge summaries. The study included patients with COVID-19 discharged from 2 hospitals in the Boston, Massachusetts area (Massachusetts General Hospital and Brigham and Women's Hospital) between March 10, 2020, and June 30, 2020. The data were divided into a training set (70%) and hold-out test set (30%). Discharge summaries were represented as bags-of-words consisting of single words (unigrams), bigrams, and trigrams. The number of features was reduced during training by excluding n-grams that occurred in fewer than 10% of discharge summaries, and further reduced using least absolute shrinkage and selection operator (LASSO) regularization while training a multiclass logistic regression model. Model performance was evaluated using the hold-out test set.

Results: The study cohort included 1737 adult patients (median age 61 [SD 18] years; 55% men; 45% White and 16% Black; 14% nonsurvivors and 61% discharged home). The model selected 179 from a vocabulary of 1056 engineered features, consisting of combinations of unigrams, bigrams, and trigrams. The top features contributing most to the classification by the model (for each outcome) were the following: "appointments specialty," "home health," and "home care" (home); "intubate" and "ARDS" (inpatient rehabilitation); "service" (SNIF); "brief assessment" and "covid" (death). The model achieved a micro-average area under the receiver operating characteristic curve value of 0.98 (95% CI 0.97-0.98) and average precision of 0.81 (95% CI 0.75-0.84) in the testing set for prediction of discharge disposition.

Conclusions: A supervised learning-based NLP approach is able to classify the discharge disposition of patients hospitalized with COVID-19. This approach has the potential to accelerate and increase the scale of research on patients' discharge disposition that is possible with EHR data.

Keywords: BoW; COVID-19; EHR; ICU; LASSO; coronavirus; electronic health record; feature selection; intensive care unit; machine learning; natural language processing; unstructured text.

©Marta Fernandes, Haoqi Sun, Aayushee Jain, Haitham S Alabsi, Laura N Brenner, Elissa Ye, Wendong Ge, Sarah I Collens, Michael J Leone, Sudeshna Das, Gregory K Robbins, Shibani S Mukerji, M Brandon Westover. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 10.02.2021.

Abstract

Grants and funding