Cadec: A corpus of adverse drug event annotations

J Biomed Inform. 2015 Jun:55:73-81. doi: 10.1016/j.jbi.2015.03.010. Epub 2015 Mar 27.


CSIRO Adverse Drug Event Corpus (Cadec) is a new rich annotated corpus of medical forum posts on patient-reported Adverse Drug Events (ADEs). The corpus is sourced from posts on social media, and contains text that is largely written in colloquial language and often deviates from formal English grammar and punctuation rules. Annotations contain mentions of concepts such as drugs, adverse effects, symptoms, and diseases linked to their corresponding concepts in controlled vocabularies, i.e., SNOMED Clinical Terms and MedDRA. The quality of the annotations is ensured by annotation guidelines, multi-stage annotations, measuring inter-annotator agreement, and final review of the annotations by a clinical terminologist. This corpus is useful for studies in the area of information extraction, or more generally text mining, from social media to detect possible adverse drug reactions from direct patient reports. The corpus is publicly available at

Keywords: Adverse drug reaction; Annotated corpus; Consumer reviews; Drug safety; Information extraction; MedDRA; Medical forum; SNOMED CT; Social media.

MeSH terms

  • Adverse Drug Reaction Reporting Systems / organization & administration*
  • Consumer Health Information / organization & administration*
  • Data Mining / methods*
  • Datasets as Topic / statistics & numerical data
  • Drug-Related Side Effects and Adverse Reactions / classification*
  • Drug-Related Side Effects and Adverse Reactions / epidemiology
  • Guidelines as Topic
  • Humans
  • Machine Learning
  • Natural Language Processing
  • Social Media / classification
  • Social Media / organization & administration*
  • Terminology as Topic
  • Vocabulary, Controlled*