Evaluating the generalizability of deep learning image classification algorithms to detect middle ear disease using otoscopy

Al-Rahim Habib; Yixi Xu; Kris Bock; Shrestha Mohanty; Tina Sederholm; William B Weeks; Rahul Dodhia; Juan Lavista Ferres; Chris Perry; Raymond Sacks; Narinder Singh

doi:10.1038/s41598-023-31921-0

Evaluating the generalizability of deep learning image classification algorithms to detect middle ear disease using otoscopy

Sci Rep. 2023 Apr 1;13(1):5368. doi: 10.1038/s41598-023-31921-0.

Authors

Al-Rahim Habib^{1

2}, Yixi Xu³, Kris Bock⁴, Shrestha Mohanty⁵, Tina Sederholm³, William B Weeks³, Rahul Dodhia³, Juan Lavista Ferres³, Chris Perry⁶, Raymond Sacks⁷, Narinder Singh^{7

8}

Affiliations

¹ Faculty of Medicine and Health, University of Sydney, Sydney, NSW, Australia. al-rahim.habib@sydney.edu.au.
² Department of Otolaryngology, Head and Neck Surgery, Westmead Hospital, Sydney, NSW, Australia. al-rahim.habib@sydney.edu.au.
³ AI for Good Lab, Microsoft, Redmond, WA, USA.
⁴ Azure FastTrack Engineering, Brisbane, QLD, Australia.
⁵ Microsoft, Redmond, WA, USA.
⁶ University of Queensland Medical School, Brisbane, QLD, Australia.
⁷ Faculty of Medicine and Health, University of Sydney, Sydney, NSW, Australia.
⁸ Department of Otolaryngology, Head and Neck Surgery, Westmead Hospital, Sydney, NSW, Australia.

Abstract

To evaluate the generalizability of artificial intelligence (AI) algorithms that use deep learning methods to identify middle ear disease from otoscopic images, between internal to external performance. 1842 otoscopic images were collected from three independent sources: (a) Van, Turkey, (b) Santiago, Chile, and (c) Ohio, USA. Diagnostic categories consisted of (i) normal or (ii) abnormal. Deep learning methods were used to develop models to evaluate internal and external performance, using area under the curve (AUC) estimates. A pooled assessment was performed by combining all cohorts together with fivefold cross validation. AI-otoscopy algorithms achieved high internal performance (mean AUC: 0.95, 95%CI: 0.80-1.00). However, performance was reduced when tested on external otoscopic images not used for training (mean AUC: 0.76, 95%CI: 0.61-0.91). Overall, external performance was significantly lower than internal performance (mean difference in AUC: -0.19, p ≤ 0.04). Combining cohorts achieved a substantial pooled performance (AUC: 0.96, standard error: 0.01). Internally applied algorithms for otoscopy performed well to identify middle ear disease from otoscopy images. However, external performance was reduced when applied to new test cohorts. Further efforts are required to explore data augmentation and pre-processing techniques that might improve external performance and develop a robust, generalizable algorithm for real-world clinical applications.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Artificial Intelligence
Deep Learning*
Ear Diseases* / diagnostic imaging
Humans
Otoscopy / methods