Data and Model Biases in Social Media Analyses: A Case Study of COVID-19 Tweets

AMIA Annu Symp Proc. 2022 Feb 21:2021:1264-1273. eCollection 2021.

Abstract

During the coronavirus disease pandemic (COVID-19), social media platforms such as Twitter have become a venue for individuals, health professionals, and government agencies to share COVID-19 information. Twitter has been a popular source of data for researchers, especially for public health studies. However, the use of Twitter data for research also has drawbacks and barriers. Biases appear everywhere from data collection methods to modeling approaches, and those biases have not been systematically assessed. In this study, we examined six different data collection methods and three different machine learning (ML) models-commonly used in social media analysis-to assess data collection bias and measure ML models' sensitivity to data collection bias. We showed that (1) publicly available Twitter data collection endpoints with appropriate strategies can collect data that is reasonably representative of the Twitter universe; and (2) careful examinations of ML models' sensitivity to data collection bias are critical.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Bias
  • COVID-19* / epidemiology
  • Data Collection / methods
  • Humans
  • Machine Learning
  • Social Media*