SEED: Symptom Extraction from English Social Media Posts using Deep Learning and Transfer Learning

Arjun Magge; Davy Weissenbacher; Karen O'Connor; Matthew Scotch; Graciela Gonzalez-Hernandez

doi:10.1101/2021.02.09.21251454

SEED: Symptom Extraction from English Social Media Posts using Deep Learning and Transfer Learning

medRxiv [Preprint]. 2022 Mar 21:2021.02.09.21251454. doi: 10.1101/2021.02.09.21251454.

Authors

Arjun Magge¹, Davy Weissenbacher¹, Karen O'Connor¹, Matthew Scotch², Graciela Gonzalez-Hernandez¹

Affiliations

¹ Perelman School of Medicine, University of Pennsylvania.
² College of Health Solutions, Arizona State University.

Abstract

The increase of social media usage across the globe has fueled efforts in digital epidemiology for mining valuable information such as medication use, adverse drug effects and reports of viral infections that directly and indirectly affect population health. Such specific information can, however, be scarce, hard to find, and mostly expressed in very colloquial language. In this work, we focus on a fundamental problem that enables social media mining for disease monitoring. We present and make available SEED, a natural language processing approach to detect symptom and disease mentions from social media data obtained from platforms such as Twitter and DailyStrength and to normalize them into UMLS terminology. Using multi-corpus training and deep learning models, the tool achieves an overall F1 score of 0.86 and 0.72 on DailyStrength and balanced Twitter datasets, significantly improving over previous approaches on the same datasets. We apply the tool on Twitter posts that report COVID19 symptoms, particularly to quantify whether the SEED system can extract symptoms absent in the training data. The study results also draw attention to the potential of multi-corpus training for performance improvements and the need for continuous training on newly obtained data for consistent performance amidst the ever-changing nature of the social media vocabulary.

Keywords: Deep Learning; Information Extraction; Natural Language Processing; Pharmacovigilance; Social Media Mining.

Publication types

Preprint

Grants and funding

R01 LM011176/LM/NLM NIH HHS/United States