Background: Machine learning techniques hold significant potential to support the diagnosis and prognosis of diseases. However, the success of these approaches is heavily dependent on rigorous data acquisition, preprocessing and data organization.
Methods: This article reviews the literature to evaluate key factors in dataset construction, focusing on data structure, preprocessing, and data organization, particularly in the context of imaging data.
Results: The main issues with data construction when dealing with medical applications are noise (incorrect or irrelevant data), sparsity/ limited availability, representativeness/variability, and data imbalance (uneven class distribution).While preprocessing steps prepare the data to be suitable for the models, data organization focuses in improving data arranging to increase the model performance. Additionally, the impact of CNN complexity in processing balanced, imbalanced, and complex datasets shows that complex CNNs are not always the optimal choice for every classification problem.
Conclusion: By integrating knowledge from Health Sciences and Biomedical Engineering, we aim to enhance healthcare professionals' understanding of machine learning for image analysis in Oral Medicine and Pathology. This encourages their involvement in patient recruitment and data acquisition, broadening their roles and significantly contributing to the creation of well-characterized datasets for future research and applications.
Copyright © 2025 Elsevier Inc. All rights reserved.