Purpose: This pilot study aims to validate the "ground truth" accuracy and consistency of proximal femur fracture classification using a large radiographic image database. The project, a collaboration between expert groups from the University of Turin and the AO Foundation, seeks to ensure that expert consensus-based annotations are reliable for future artificial intelligence (AI) model development.
Methods: A cross-sectional, diagnostic accuracy study was conducted using a randomly selected subset of 300 anteroposterior pelvic radiographs from a single-center image repository created at the University of Turin within the AO Innovation Translation Center framework. Fracture classification annotations were independently provided by the local clinical expert group (LC-EG) and by an independent AO expert group of surgeons (AO-EG). To assess interrater reliability between the two groups, Cohen's kappa coefficient was calculated for categorical agreement on the presence of a fracture and AO/OTA classification.
Results: The comparison of annotations from LC-EG and AO-EG yielded a Cohen's kappa of 0.81 (95 % confidence interval: 0.75-0.87) and a percentage agreement of 87.67 % (95 % confidence interval: 87.63-87.70) for the classification of proximal femur fractures into three defined categories: no fracture, fracture type 31A, and fracture type 31B. These results confirm a high level of consistency between the two expert groups in annotating the image dataset.
Conclusion: The observed interrater reliability between the LC-EG and AO-EG supports the credibility of the reference annotations, establishing a validated ground truth for proximal femur fractures. This evidence justifies using the radiographic image database as a benchmark for future studies and as a foundation for transparent, reproducible AI development and evaluation, thereby facilitating safer integration of decision support tools into orthopedic trauma workflows.
Keywords: Artificial Intelligence; Diagnostic imaging; Fracture classification; Hip fractures; Interrater reliability; Observer variation; Proximal femur fractures; Radiography.
Copyright © 2026 The Authors. Published by Elsevier Ltd.. All rights reserved.