Monitoring COVID-19 pandemic through the lens of social media using natural language processing and machine learning

Health Inf Sci Syst. 2021 Jun 25;9(1):25. doi: 10.1007/s13755-021-00158-4. eCollection 2021 Dec.

Abstract

Purpose: It has been over a year since the first known case of coronavirus disease (COVID-19) emerged, yet the pandemic is far from over. To date, the coronavirus pandemic has infected over eighty million people and has killed more than 1.78 million worldwide. This study aims to explore "how useful is Reddit social media platform to surveil COVID-19 pandemic?" and "how do people's concerns/behaviors change over the course of COVID-19 pandemic in North Carolina?". The purpose of this study was to compare people's thoughts, behavior changes, discussion topics, and the number of confirmed cases and deaths by applying natural language processing (NLP) to COVID-19 related data.

Methods: In this study, we collected COVID-19 related data from 18 subreddits of North Carolina from March to August 2020. Next, we applied methods from natural language processing and machine learning to analyze collected Reddit posts using feature engineering, topic modeling, custom named-entity recognition (NER), and BERT-based (Bidirectional Encoder Representations from Transformers) sentence clustering. Using these methods, we were able to glean people's responses and their concerns about COVID-19 pandemic in North Carolina.

Results: We observed a positive change in attitudes towards masks for residents in North Carolina. The high-frequency words in all subreddit corpora for each of the COVID-19 mitigation strategy categories are: Distancing (DIST)-"social distance/distancing", "lockdown", and "work from home"; Disinfection (DIT)-"(hand) sanitizer/soap", "hygiene", and "wipe"; Personal Protective Equipment (PPE)-"mask/facemask(s)/face shield", "n95(s)/kn95", and "cloth/gown"; Symptoms (SYM)-"death", "flu/influenza", and "cough/coughed"; Testing (TEST)-"cases", "(antibody) test", and "test results (positive/negative)".

Conclusion: The findings in our study show that the use of Reddit data to monitor COVID-19 pandemic in North Carolina (NC) was effective. The study shows the utility of NLP methods (e.g. cosine similarity, Latent Dirichlet Allocation (LDA) topic modeling, custom NER and BERT-based sentence clustering) in discovering the change of the public's concerns/behaviors over the course of COVID-19 pandemic in NC using Reddit data. Moreover, the results show that social media data can be utilized to surveil the epidemic situation in a specific community.

Keywords: COVID-19; Named-entity recognition; Natural language processing; Sentence clustering; Social media; Topic modeling.