Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr

Environ Int. 2022 Jan 15;159:107025. doi: 10.1016/j.envint.2021.107025. Epub 2021 Dec 14.


Introduction: There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health questions due to unique data needs to support reviews in these fields.

Objectives: To develop and evaluate a flexible, web-based tool for semi-automated data extraction that: 1) makes data extraction predictions with user verification, 2) integrates token-level annotations, and 3) connects extracted entities to support hierarchical data extraction.

Methods: Dextr was developed with Agile software methodology using a two-team approach. The development team outlined proposed features and coded the software. The advisory team guided developers and evaluated Dextr's performance on precision, recall, and extraction time by comparing a manual extraction workflow to a semi-automated extraction workflow using a dataset of 51 environmental health animal studies.

Results: The semi-automated workflow did not appear to affect precision rate (96.0% vs. 95.4% manual, p = 0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p < 0.01), and substantially reduced the median extraction time (436 s vs. 933 s per study manual, p < 0.01) compared to a manual workflow.

Discussion: Dextr provides similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health sciences literature with a simple user interface, incorporates the key capabilities of user verification and entity connecting, provides a platform for further automation developments, and has the potential to improve data extraction for literature reviews in this and other fields.

Keywords: Automation; Literature review; Machine learning; Natural language processing; Scoping review; Systematic evidence map; Systematic review; Text mining.

Publication types

  • Research Support, N.I.H., Intramural

MeSH terms

  • Animals
  • Machine Learning*
  • Public Health*
  • Review Literature as Topic
  • Software