Approach to a preparation of dataset combining digital mammographic images and patient clinical data from electronic medical records

Quant Imaging Med Surg. 2025 Apr 1;15(4):3631-3640. doi: 10.21037/qims-24-1689. Epub 2025 Mar 18.

Abstract

A process of generating datasets is complex, expensive, and labor-intensive. However, we can optimize this process by modifying existing datasets for their reuse, which also complies with the FAIR principles. In this work, we developed a method to enrich a dataset with patients' clinical information from electronic medical records. A proposed approach includes the following stages: selection of studies with and without signs of the chosen pathology, formation of a list of clinical signs based on the literature review results, extraction of clinical information, data processing. The presented method allows enriching a dataset of radiological studies with clinical parameters, which will save resources and assure further dataset application. A limitation of the method is its dependence on completeness of entered clinical information into electronic medical records. During our work, the dataset has been generated and registered, which includes mammographic images of 200 patients and the following clinical information: a patient's age at the time of study, the age at menopause, and a number of births. A statistical analysis of the dataset was carried out. Despite a very weak correlation between the studied parameters and the presence of pathology, statistically significant differences were revealed between the groups of patients with and without pathology for the features of age at the time of study, age at menopause, and late menopause. The prepared dataset can be used for scientific research, as well as for training and testing software based on artificial intelligence (AI) technologies (AI-based software), which evaluates not only mammographic images, but also clinical information.

Keywords: Dataset; artificial intelligence (AI); malignant breast tumors; mammography.