Large chest X-rays (CXR) datasets have been collected to train deep learning models to detect thorax pathology on CXR. However, most CXR datasets are from single-center studies and the collected pathologies are often imbalanced. The aim of this study was to automatically construct a public, weakly-labeled CXR database from articles in PubMed Central Open Access (PMC-OA) and to assess model performance on CXR pathology classification by using this database as additional training data. Our framework includes text extraction, CXR pathology verification, subfigure separation, and image modality classification. We have extensively validated the utility of the automatically generated image database on thoracic disease detection tasks, including Hernia, Lung Lesion, Pneumonia, and pneumothorax. We pick these diseases due to their historically poor performance in existing datasets: the NIH-CXR dataset (112,120 CXR) and the MIMIC-CXR dataset (243,324 CXR). We find that classifiers fine-tuned with additional PMC-CXR extracted by the proposed framework consistently and significantly achieved better performance than those without (e.g., Hernia: 0.9335 vs 0.9154; Lung Lesion: 0.7394 vs. 0.7207; Pneumonia: 0.7074 vs. 0.6709; Pneumothorax 0.8185 vs. 0.7517, all in AUC with p< 0.0001) for CXR pathology detection. In contrast to previous approaches that manually submit the medical images to the repository, our framework can automatically collect figures and their accompanied figure legends. Compared to previous studies, the proposed framework improved subfigure segmentation and incorporates our advanced self-developed NLP technique for CXR pathology verification. We hope it complements existing resources and improves our ability to make biomedical image data findable, accessible, interoperable, and reusable.
Keywords: Artificial intelligence; Chest X-ray; PubMed.
Copyright © 2023 Elsevier Ltd. All rights reserved.