Twitter: a good place to detect health conditions

PLoS One. 2014 Jan 29;9(1):e86191. doi: 10.1371/journal.pone.0086191. eCollection 2014.


With the proliferation of social networks and blogs, the Internet is increasingly being used to disseminate personal health information rather than just as a source of information. In this paper we exploit the wealth of user-generated data, available through the micro-blogging service Twitter, to estimate and track the incidence of health conditions in society. The method is based on two stages: we start by extracting possibly relevant tweets using a set of specially crafted regular expressions, and then classify these initial messages using machine learning methods. Furthermore, we selected relevant features to improve the results and the execution times. To test the method, we considered four health states or conditions, namely flu, depression, pregnancy and eating disorders, and two locations, Portugal and Spain. We present the results obtained and demonstrate that the detection results and the performance of the method are improved after feature selection. The results are promising, with areas under the receiver operating characteristic curve between 0.7 and 0.9, and f-measure values around 0.8 and 0.9. This fact indicates that such approach provides a feasible solution for measuring and tracking the evolution of health states within the society.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Artificial Intelligence / statistics & numerical data*
  • Blogging / statistics & numerical data*
  • Depression / epidemiology
  • Feeding and Eating Disorders / epidemiology
  • Female
  • Health Knowledge, Attitudes, Practice*
  • Humans
  • Influenza, Human / epidemiology
  • Portugal / epidemiology
  • Pregnancy
  • ROC Curve
  • Spain / epidemiology

Grant support

The work of VMP, MA and FC was supported by Xunta de Galicia CN2012/211, the Ministry of Education and Science of Spain and FEDER funds of the European Union (Project TIN2009-14203). SM and JLO were funded by FEDER through the COMPETE programme and by Portuguese national funds through FCT - “Fundação Para a Ciência e a Tecnologia” under project number PTDC/EIA-CCO/100541/2008 (FCOMP-01-0124-FEDER-010029), and by the QREN Mais Centro program through the Cloud Thinking project (CENTRO-07-ST24-FEDER-002031). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.