Validation pipeline for machine learning algorithm assessment for multiple vendors

Bernardo C Bizzo; Shadi Ebrahimian; Mark E Walters; Mark H Michalski; Katherine P Andriole; Keith J Dreyer; Mannudeep K Kalra; Tarik Alkasab; Subba R Digumarthy

doi:10.1371/journal.pone.0267213

Validation pipeline for machine learning algorithm assessment for multiple vendors

PLoS One. 2022 Apr 29;17(4):e0267213. doi: 10.1371/journal.pone.0267213. eCollection 2022.

Authors

Bernardo C Bizzo^{1

2}, Shadi Ebrahimian², Mark E Walters¹, Mark H Michalski^{1

2}, Katherine P Andriole^{1

3}, Keith J Dreyer^{1

2}, Mannudeep K Kalra^{1

2}, Tarik Alkasab^{1

2}, Subba R Digumarthy²

Affiliations

¹ MGH & BWH Center for Clinical Data Science, Mass General Brigham, Boston, Massachusetts, United States of America.
² Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, United States of America.
³ Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, United States of America.

Abstract

A standardized objective evaluation method is needed to compare machine learning (ML) algorithms as these tools become available for clinical use. Therefore, we designed, built, and tested an evaluation pipeline with the goal of normalizing performance measurement of independently developed algorithms, using a common test dataset of our clinical imaging. Three vendor applications for detecting solid, part-solid, and groundglass lung nodules in chest CT examinations were assessed in this retrospective study using our data-preprocessing and algorithm assessment chain. The pipeline included tools for image cohort creation and de-identification; report and image annotation for ground-truth labeling; server partitioning to receive vendor "black box" algorithms and to enable model testing on our internal clinical data (100 chest CTs with 243 nodules) from within our security firewall; model validation and result visualization; and performance assessment calculating algorithm recall, precision, and receiver operating characteristic curves (ROC). Algorithm true positives, false positives, false negatives, recall, and precision for detecting lung nodules were as follows: Vendor-1 (194, 23, 49, 0.80, 0.89); Vendor-2 (182, 270, 61, 0.75, 0.40); Vendor-3 (75, 120, 168, 0.32, 0.39). The AUCs for detection of solid (0.61-0.74), groundglass (0.66-0.86) and part-solid (0.52-0.86) nodules varied between the three vendors. Our ML model validation pipeline enabled testing of multi-vendor algorithms within the institutional firewall. Wide variations in algorithm performance for detection as well as classification of lung nodules justifies the premise for a standardized objective ML algorithm evaluation process.

MeSH terms

Algorithms
Humans
Lung Neoplasms* / diagnosis
Machine Learning
Retrospective Studies
Tomography, X-Ray Computed / methods

Grants and funding

The author(s) received no specific funding for this work.