Factors Impacting the Performance of Deep Learning Detection of Pulmonary Emboli

J Am Coll Radiol. 2025 Dec 26:S1546-1440(25)00746-X. doi: 10.1016/j.jacr.2025.12.028. Online ahead of print.

Abstract

Objective: AI models are increasingly adopted in clinical practice, yet their generalizability outside controlled validation settings remains unclear. We aimed to evaluate the real-world performance of an FDA-cleared commercial pulmonary embolism (PE) detection model and identify technical, demographic, and clinical factors associated with performance variation, to inform postproduction monitoring and deployment strategies.

Methods: This retrospective study included 11,144 CT pulmonary angiography examinations performed in a single health system between April 2023 and June 2024, processed by a commercial PE detection model. Technical parameters (scanner manufacturer, slice thickness, dose index volume, contrast enhancement of pulmonary artery), demographic factors (age, gender, race, body mass index), and clinical comorbidities (heart failure, pulmonary hypertension, cancer) were extracted from DICOM headers and electronic health records. Univariate and multivariable logistic regression analyses identified factors associated with decreased performance.

Results: There were 1,193 of 11,144 (10.7%) PE-positive cases. The model had an overall 83.5% (95% confidence interval [CI] 81.3%-85.5%) sensitivity and positive predictive value was 90.5% (95% CI 88.7%-92.1%). Multivariable analysis showed significant associations between decreased sensitivity and scanner manufacturer (odds ratio [OR] 0.25, 95% CI 0.14-0.46 and OR 0.34, 95% CI 0.17-0.69, for different vendors versus reference, P < .003), increased slice thickness (OR 0.74, 95% CI 0.57-0.95 per 1-mm increase, P = .018), presence of imaging artifacts (OR 0.33, 95% CI 0.23-0.48, P < .001), heart failure (OR 0.58, 95% CI 0.38-0.88, P = .010), and pulmonary hypertension (OR 0.44, 95% CI 0.25-0.77, P = .004). Demographic factors including age, gender, race, and body mass index showed no significant associations with model performance.

Conclusion: AI performance in clinical practice varies significantly based on technical imaging parameters and patient comorbidities. Understanding these factors is essential for optimal product selection and for effective postdeployment monitoring, enabling investigation of model drift in evolving clinical settings. The findings highlight the need for local validation frameworks that account for institution-specific technical infrastructure and patient populations, to ensure safe AI deployment across diverse clinical environments.

Keywords: AI; PE; deep learning; drift; stress-testing.