Pilot validation study for a large image database of proximal femur fracture anteroposterior radiographs: Searching for the ground truth

Injury. 2026 Mar;57(3):113056. doi: 10.1016/j.injury.2026.113056. Epub 2026 Jan 22.

Abstract

Purpose: This pilot study aims to validate the "ground truth" accuracy and consistency of proximal femur fracture classification using a large radiographic image database. The project, a collaboration between expert groups from the University of Turin and the AO Foundation, seeks to ensure that expert consensus-based annotations are reliable for future artificial intelligence (AI) model development.

Methods: A cross-sectional, diagnostic accuracy study was conducted using a randomly selected subset of 300 anteroposterior pelvic radiographs from a single-center image repository created at the University of Turin within the AO Innovation Translation Center framework. Fracture classification annotations were independently provided by the local clinical expert group (LC-EG) and by an independent AO expert group of surgeons (AO-EG). To assess interrater reliability between the two groups, Cohen's kappa coefficient was calculated for categorical agreement on the presence of a fracture and AO/OTA classification.

Results: The comparison of annotations from LC-EG and AO-EG yielded a Cohen's kappa of 0.81 (95 % confidence interval: 0.75-0.87) and a percentage agreement of 87.67 % (95 % confidence interval: 87.63-87.70) for the classification of proximal femur fractures into three defined categories: no fracture, fracture type 31A, and fracture type 31B. These results confirm a high level of consistency between the two expert groups in annotating the image dataset.

Conclusion: The observed interrater reliability between the LC-EG and AO-EG supports the credibility of the reference annotations, establishing a validated ground truth for proximal femur fractures. This evidence justifies using the radiographic image database as a benchmark for future studies and as a foundation for transparent, reproducible AI development and evaluation, thereby facilitating safer integration of decision support tools into orthopedic trauma workflows.

Keywords: Artificial Intelligence; Diagnostic imaging; Fracture classification; Hip fractures; Interrater reliability; Observer variation; Proximal femur fractures; Radiography.

Publication types

  • Validation Study

MeSH terms

  • Artificial Intelligence
  • Cross-Sectional Studies
  • Databases, Factual
  • Female
  • Femoral Fractures* / classification
  • Femoral Fractures* / diagnostic imaging
  • Humans
  • Male
  • Observer Variation
  • Pilot Projects
  • Proximal Femoral Fractures
  • Radiography*
  • Reproducibility of Results