A Scalable Framework to Detect Personal Health Mentions on Twitter
- PMID: 26048075
- PMCID: PMC4526910
- DOI: 10.2196/jmir.4305
A Scalable Framework to Detect Personal Health Mentions on Twitter
Abstract
Background: Biomedical research has traditionally been conducted via surveys and the analysis of medical records. However, these resources are limited in their content, such that non-traditional domains (eg, online forums and social media) have an opportunity to supplement the view of an individual's health.
Objective: The objective of this study was to develop a scalable framework to detect personal health status mentions on Twitter and assess the extent to which such information is disclosed.
Methods: We collected more than 250 million tweets via the Twitter streaming API over a 2-month period in 2014. The corpus was filtered down to approximately 250,000 tweets, stratified across 34 high-impact health issues, based on guidance from the Medical Expenditure Panel Survey. We created a labeled corpus of several thousand tweets via a survey, administered over Amazon Mechanical Turk, that documents when terms correspond to mentions of personal health issues or an alternative (eg, a metaphor). We engineered a scalable classifier for personal health mentions via feature selection and assessed its potential over the health issues. We further investigated the utility of the tweets by determining the extent to which Twitter users disclose personal health status.
Results: Our investigation yielded several notable findings. First, we find that tweets from a small subset of the health issues can train a scalable classifier to detect health mentions. Specifically, training on 2000 tweets from four health issues (cancer, depression, hypertension, and leukemia) yielded a classifier with precision of 0.77 on all 34 health issues. Second, Twitter users disclosed personal health status for all health issues. Notably, personal health status was disclosed over 50% of the time for 11 out of 34 (33%) investigated health issues. Third, the disclosure rate was dependent on the health issue in a statistically significant manner (P<.001). For instance, more than 80% of the tweets about migraines (83/100) and allergies (85/100) communicated personal health status, while only around 10% of the tweets about obesity (13/100) and heart attack (12/100) did so. Fourth, the likelihood that people disclose their own versus other people's health status was dependent on health issue in a statistically significant manner as well (P<.001). For example, 69% (69/100) of the insomnia tweets disclosed the author's status, while only 1% (1/100) disclosed another person's status. By contrast, 1% (1/100) of the Down syndrome tweets disclosed the author's status, while 21% (21/100) disclosed another person's status.
Conclusions: It is possible to automatically detect personal health status mentions on Twitter in a scalable manner. These mentions correspond to the health issues of the Twitter users themselves, but also other individuals. Though this study did not investigate the veracity of such statements, we anticipate such information may be useful in supplementing traditional health-related sources for research purposes.
Keywords: consumer health; infodemiology; information retrieval; machine learning; social media; twitter.
Conflict of interest statement
Conflicts of Interest: None declared.
Figures
Similar articles
-
What are health-related users tweeting? A qualitative content analysis of health-related users and their messages on twitter.J Med Internet Res. 2014 Oct 15;16(10):e237. doi: 10.2196/jmir.3765. J Med Internet Res. 2014. PMID: 25591063 Free PMC article.
-
Characterizing the Discussion of Antibiotics in the Twittersphere: What is the Bigger Picture?J Med Internet Res. 2015 Jun 19;17(6):e154. doi: 10.2196/jmir.4220. J Med Internet Res. 2015. PMID: 26091775 Free PMC article.
-
Establishing a Link Between Prescription Drug Abuse and Illicit Online Pharmacies: Analysis of Twitter Data.J Med Internet Res. 2015 Dec 16;17(12):e280. doi: 10.2196/jmir.5144. J Med Internet Res. 2015. PMID: 26677966 Free PMC article.
-
Are Health-Related Tweets Evidence Based? Review and Analysis of Health-Related Tweets on Twitter.J Med Internet Res. 2015 Oct 29;17(10):e246. doi: 10.2196/jmir.4898. J Med Internet Res. 2015. PMID: 26515535 Free PMC article. Review.
-
Toward a Mixed-Methods Research Approach to Content Analysis in The Digital Age: The Combined Content-Analysis Model and its Applications to Health Care Twitter Feeds.J Med Internet Res. 2016 Mar 8;18(3):e60. doi: 10.2196/jmir.5391. J Med Internet Res. 2016. PMID: 26957477 Free PMC article. Review.
Cited by
-
Methodologies for Monitoring Mental Health on Twitter: Systematic Review.J Med Internet Res. 2023 May 8;25:e42734. doi: 10.2196/42734. J Med Internet Res. 2023. PMID: 37155236 Free PMC article.
-
Australasian Institute of Digital Health Summit 2022-Automated Social Media Surveillance for Detection of Vaccine Safety Signals: A Validation Study.Appl Clin Inform. 2023 Jan;14(1):1-10. doi: 10.1055/a-1975-4061. Epub 2022 Nov 9. Appl Clin Inform. 2023. PMID: 36351547 Free PMC article.
-
Identifying Patients With Inflammatory Bowel Disease on Twitter and Learning From Their Personal Experience: Retrospective Cohort Study.J Med Internet Res. 2022 Aug 2;24(8):e29186. doi: 10.2196/29186. J Med Internet Res. 2022. PMID: 35917151 Free PMC article.
-
Crowdsourcing for Machine Learning in Public Health Surveillance: Lessons Learned From Amazon Mechanical Turk.J Med Internet Res. 2022 Jan 18;24(1):e28749. doi: 10.2196/28749. J Med Internet Res. 2022. PMID: 35040794 Free PMC article.
-
Understanding Weight Loss via Online Discussions: Content Analysis of Reddit Posts Using Topic Modeling and Word Clustering Techniques.J Med Internet Res. 2020 Jun 8;22(6):e13745. doi: 10.2196/13745. J Med Internet Res. 2020. PMID: 32510460 Free PMC article.
References
-
- Garratt A, Ruta D, Abdalla M, Buckingham J, Russell I. The SF36 health survey questionnaire: an outcome measure suitable for routine use within the NHS? BMJ. 1993 May 29;306(6890):1440–4. http://europepmc.org/abstract/MED/8518640 - PMC - PubMed
-
- Samsa G P, Matchar D B, Goldstein L B, Bonito A J, Lux L J, Witter D M, Bian J. Quality of anticoagulation management among patients with atrial fibrillation: results of a review of medical records from 2 communities. Arch Intern Med. 2000 Apr 10;160(7):967–73. - PubMed
-
- Williams L S, Yilmaz E Y, Lopez-Yunez A M. Retrospective assessment of initial stroke severity with the NIH Stroke Scale. Stroke. 2000 Apr;31(4):858–62. http://stroke.ahajournals.org/cgi/pmidlookup?view=long&pmid=10753988 - PubMed
-
- Quam L, Ellis L B, Venus P, Clouse J, Taylor C G, Leatherman S. Using claims data for epidemiologic research. The concordance of claims-based criteria with the medical record and patient survey for identifying a hypertensive population. Med Care. 1993 Jun;31(6):498–507. - PubMed
-
- Eysenbach G, Wyatt J. Using the Internet for surveys and health research. J Med Internet Res. 2002;4(2):E13. doi: 10.2196/jmir.4.2.e13. http://www.jmir.org/2002/2/e13/ - DOI - PMC - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
