Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

J Am Med Inform Assoc. 2017 Nov 1;24(6):1165-1168. doi: 10.1093/jamia/ocx053.


Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed to make this process more efficient via a hybrid approach using both crowdsourcing and ML.

Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise.

Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone.

Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

Keywords: crowdsourcing; evidence-based medicine; human computation; machine learning; natural language processing.

MeSH terms

  • Biomedical Research
  • Crowdsourcing*
  • Databases, Bibliographic
  • Information Storage and Retrieval / methods*
  • Machine Learning*
  • Natural Language Processing
  • ROC Curve
  • Randomized Controlled Trials as Topic*
  • Review Literature as Topic
  • Support Vector Machine