Objective: Medical imaging databases suitable for training machine learning/computer vision algorithms are scarce, limiting the potential for development and generalisation of clinical tools. Clinical trial databases are a source of data, known for their high-quality data and reliable annotations. However, they are not tailored to the needs of machine learning or deep learning models. Our objective was to develop a methodology and tools that enable the curation of these databases specifically for the training or testing of artificial intelligence tools.
Materials and methods: MRIs from the French centres of the EURAD clinical trial (MRI of women with pelvic adnexal lesions) were used to constitute the database. We developed the steps required to curate a clinical trial database: definition of inclusion and exclusion criteria, removal of unnecessary data according to the principle of parsimony, quality control, and harmonisation.
Results: A total of 713 patients were included in our study. The directory structure was simplified, and the number of files and folders decreased by 44% and 95% respectively. Only 62 DICOM fields were considered necessary for artificial intelligence (AI) model applications. Quality control was implemented in repeated cycles of automatic checks, followed by a final manual random inspection. Finally, sequence names were harmonised for easy identification when developing models.
Conclusion: Using a clinical trial database, we propose a methodology to build a database suitable to train or test AI algorithms. This study underlines the need for a more global and systematic framework for the secondary use of health data to develop AI imaging tools for patient care.
Critical relevance statement: We propose and detail a framework and tools to curate a clinical trial database to allow secondary use of the high-quality annotated data generated in clinical trials for the training and testing of artificial intelligence models.
Key points: Clinical trial imaging databases are not adapted for AI model development. A curation process of MRI databases was developed for machine learning applications. We share the open-source tools and methodology developed in this study.
Keywords: Artificial intelligence; Clinical trial; Data curation; MRI; Medical computer vision.
© 2025. The Author(s).