Automated extraction of reported statistical analyses: towards a logical representation of clinical trial literature

AMIA Annu Symp Proc. 2012:2012:350-9. Epub 2012 Nov 3.

Abstract

Randomized controlled trials are an important source of evidence for guiding clinical decisions when treating a patient. However, given the large number of studies and their variability in quality, determining how to summarize reported results and formalize them as part of practice guidelines continues to be a challenge. We have developed a set of information extraction and annotation tools to automate the identification of key information from papers related to the hypothesis, sample size, statistical test, confidence interval, significance level, and conclusions. We adapted the Automated Sequence Annotation Pipeline to map extracted phrases to relevant knowledge sources. We trained and tested our system on a corpus of 42 full-text articles related to chemotherapy of non-small cell lung cancer. On our test set of 7 papers, we obtained an overall precision of 86%, recall of 78%, and an F-score of 0.82 for classifying sentences. This work represents our efforts towards utilizing this information for quality assessment, meta-analysis, and modeling.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Carcinoma, Non-Small-Cell Lung
  • Electronic Data Processing*
  • Evidence-Based Medicine
  • Humans
  • Information Storage and Retrieval / methods*
  • Lung Neoplasms
  • Natural Language Processing
  • Randomized Controlled Trials as Topic* / statistics & numerical data
  • Sensitivity and Specificity