Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 2;9(1):73.
doi: 10.1186/s13643-020-01324-7.

Machine learning for screening prioritization in systematic reviews: comparative performance of Abstrackr and EPPI-Reviewer

Affiliations

Machine learning for screening prioritization in systematic reviews: comparative performance of Abstrackr and EPPI-Reviewer

Amy Y Tsou et al. Syst Rev. .

Abstract

Background: Improving the speed of systematic review (SR) development is key to supporting evidence-based medicine. Machine learning tools which semi-automate citation screening might improve efficiency. Few studies have assessed use of screening prioritization functionality or compared two tools head to head. In this project, we compared performance of two machine-learning tools for potential use in citation screening.

Methods: Using 9 evidence reports previously completed by the ECRI Institute Evidence-based Practice Center team, we compared performance of Abstrackr and EPPI-Reviewer, two off-the-shelf citations screening tools, for identifying relevant citations. Screening prioritization functionality was tested for 3 large reports and 6 small reports on a range of clinical topics. Large report topics were imaging for pancreatic cancer, indoor allergen reduction, and inguinal hernia repair. We trained Abstrackr and EPPI-Reviewer and screened all citations in 10% increments. In Task 1, we inputted whether an abstract was ordered for full-text screening; in Task 2, we inputted whether an abstract was included in the final report. For both tasks, screening continued until all studies ordered and included for the actual reports were identified. We assessed potential reductions in hypothetical screening burden (proportion of citations screened to identify all included studies) offered by each tool for all 9 reports.

Results: For the 3 large reports, both EPPI-Reviewer and Abstrackr performed well with potential reductions in screening burden of 4 to 49% (Abstrackr) and 9 to 60% (EPPI-Reviewer). Both tools had markedly poorer performance for 1 large report (inguinal hernia), possibly due to its heterogeneous key questions. Based on McNemar's test for paired proportions in the 3 large reports, EPPI-Reviewer outperformed Abstrackr for identifying articles ordered for full-text review, but Abstrackr performed better in 2 of 3 reports for identifying articles included in the final report. For small reports, both tools provided benefits but EPPI-Reviewer generally outperformed Abstrackr in both tasks, although these results were often not statistically significant.

Conclusions: Abstrackr and EPPI-Reviewer performed well, but prioritization accuracy varied greatly across reports. Our work suggests screening prioritization functionality is a promising modality offering efficiency gains without giving up human involvement in the screening process.

Keywords: Abstrackr; Citation screening; EPPI-Reviewer; Efficiency; Machine learning; Methodology; Screening burden; Screening prioritization; Text-mining.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Screening prioritization and potential reduced screening burden. This figure demonstrates graphically how screening prioritization works. Prior to screening, the articles ultimately included are randomly dispersed (top half). Reviewers train the algorithm by manually including/excluding studies until a pre-specified number of studies are included. The algorithm then generalizes rules and prioritizes the remaining studies for evaluation by the reviewer
Fig. 2
Fig. 2
Citation screening process. This figure shows our process for analyzing ordered lists in 10% increments. During training, each tool received sufficient input to generate a prediction algorithm and prioritize all remaining unscreened citations by ordering them from most to least relevant. After this point, we exported ordered lists of unscreened remaining abstracts at each 10% increment (for each evidence report) until all relevant articles had been presented for screening by each tool
Fig. 3
Fig. 3
Prioritization accuracy for Task 1 (inclusion for full-text screening). This figure shows the percentage of articles screened by each tool to reach 100% of all articles ordered for full-text screening. Smaller bars indicate higher prioritization accuracy (as fewer articles had to be screened in order to achieve 100% sensitivity). The comparative performance of the two tools for each evidence report is displayed in gray (Abstrackr) and white (EPPI-Reviewer)
Fig. 4
Fig. 4
Prioritization accuracy for Task 2 (final inclusion). This figure shows the percentage of articles screened by each tool to reach 100% of all articles included in the final report. Smaller bars indicate higher prioritization accuracy (as fewer articles had to be screened in order to achieve 100% sensitivity). The comparative performance of the two tools for each evidence report is displayed in gray (Abstrackr) and white (EPPI-Reviewer)
Fig. 5
Fig. 5
Statistical comparisons of screening burden. Each point is a McNemar’s odds ratio comparing EPPI-Reviewer and Abstrackr. Points to the left of center favor EPPI-Reviewer, and points to the right of center favor Abstrackr. Horizontal bars show 95% confidence intervals. Note that for dabigatran, in Task 2 (final inclusion), Abstrackr required all abstracts to be screened before reaching the last included study, while EPPI-Reviewer only required 70% of abstracts screened to identify all included studies. This resulted in a McNemar’s odds ratio of infinite value, hence the apparent ln (OR) of − 3 was created to represent this finding on the graph; the result was statistically significant in favor of EPPI-Reviewer
Fig. 6
Fig. 6
Sensitivity at various thresholds for pancreatic cancer imaging (Task 1). This figure shows the proportion of included articles that were screened at each 10% increment. Performance for both Abstrackr (dotted lines with triangles) and EPPI-Reviewer (solid lines with circles) is plotted. Chance performance is shown by the 45 degree line. As the y-axis plots sensitivity, curves closer to the top left of the graph indicate faster learning
Fig. 7
Fig. 7
Sensivity at various thresholds for pancreatic cancer imaging (Task 2). This figure shows the proportion of included articles that were screened at each 10% increment. Performance for both Abstrackr (dotted lines with triangles) and EPPI-Reviewer (solid lines with circles) is plotted. Chance performance is shown by the 45 degree line. As the y-axis plots sensitivity, curves closer to the top left of the graph indicate faster learning
Fig. 8
Fig. 8
Sensitivity at various thresholds for indoor allergen reduction (Task 1). This figure shows the proportion of included articles that were screened at each 10% increment. Performance for both Abstrackr (dotted lines with triangles) and EPPI-Reviewer (solid lines with circles) is plotted. Chance performance is shown by the 45 degree line. As the y-axis plots sensitivity, curves closer to the top left of the graph indicate faster learning
Fig. 9
Fig. 9
Sensitivy at various thresholds for indoor allergen reduction (Task 2). This figure shows the proportion of included articles that were screened at each 10% increment. Performance for both Abstrackr (dotted lines with triangles) and EPPI-Reviewer (solid lines with circles) is plotted. Chance performance is shown by the 45 degree line. As the y-axis plots sensitivity, curves closer to the top left of the graph indicate faster learning
Fig. 10
Fig. 10
Sensitivity at various thresholds, surgical interventions for inguinal hernia (Task 1). This figure shows the proportion of included articles that were screened at each 10% increment. Performance for both Abstrackr (dotted lines with triangles) and EPPI-Reviewer (solid lines with circles) is plotted. Chance performance is shown by the 45 degree line. As the y-axis plots sensitivity, curves closer to the top left of the graph indicate faster learning
Fig. 11
Fig. 11
Sensivity at various thresholds, surgical interventions for inguinal hernia (Task 2). This figure shows the proportion of included articles that were screened at each 10% increment. Performance for both Abstrackr (dotted lines with triangles) and EPPI-Reviewer (solid lines with circles) is plotted. Chance performance is shown by the 45 degree line. As the y-axis plots sensitivity, curves closer to the top left of the graph indicate faster learning

Similar articles

Cited by

References

    1. Sackett, David R. Evidence-based medicine: how to practice and teach EBM, 2nd Edition: By David L. Sackett, Sharon E. Straus, W. Scott Richardson, William Rosenberg, and R. Brian Haynes, Churchill Livingstone, 2000. Vol. 16. 2001 [cited 2019 Jul 18]. Available from: 10.1177/088506660101600307.
    1. Committee on Standards for Systematic Reviews, Institute of Medicine . Finding what works in health care: standards for systematic reviews. 2011. - PubMed
    1. Institute of Medicine (US) Committee on Standards for Developing Trustworthy Clinical Practice Guidelines . In: Clinical practice guidelines we can trust. Graham R, Mancher M, Miller Wolman D, Greenfield S, Steinberg E, editors. Washington: National Academies Press (US); 2011. - PubMed
    1. Shekelle PG. Clinical practice guidelines: what’s next? JAMA. 2018;320(8):757–758. doi: 10.1001/jama.2018.9660. - DOI - PubMed
    1. Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7(2):e012545. doi: 10.1136/bmjopen-2016-012545. - DOI - PMC - PubMed