Assessing the performance of ChatGPT-4o on the Turkish Orthopedics and Traumatology Board Examination

Jt Dis Relat Surg. 2025 Apr 5;36(2):304-310. doi: 10.52312/jdrs.2025.1958. Epub 2025 Apr 5.

Abstract

Objectives: This study aims to assess the overall performance of ChatGPT version 4-omni (GPT-4o) on the Turkish Orthopedics and Traumatology Board Examination (TOTBE) using actual examinees as a reference point to evaluate and compare the performance of GPT-4o with that of human participants.

Materials and methods: In this study, GPT-4o was tested with multiple-choice questions that formed the first step of 14 TOTBEs conducted between 2010 and 2023. The assessment of image-based questions was conducted separately for all exams. The questions were classified based on the subspecialties for the five exams (2010-2014). The performance of GPT-4o was assessed and compared to those of actual examinees of the TOTBE.

Results: The mean total score of GPT-4o was 70.2±5.64 (range, 61 to 84), whereas that of actual examinees was 58±3.28 (range, 53.6 to 64.6). Considering accuracy rates, GPT-4o demonstrated 62% accuracy on image-based questions and 70% accuracy on text-based questions. It also demonstrated superior performance in the field of basic sciences, whereas actual examinees performed better in the specialty of reconstruction. Both GPT-4o and actual examinees exhibited the lowest scores in the subspecialty of lower extremity and foot.

Conclusion: Our study results showed that GPT-4o performed well on the TOTBE, particularly in basic sciences. While it demonstrated accuracy comparable to actual examinees in some areas, these findings highlight its potential as a helpful tool in medical education.

MeSH terms

  • Clinical Competence*
  • Education, Medical, Graduate* / methods
  • Educational Measurement* / methods
  • Generative Artificial Intelligence
  • Humans
  • Orthopedics* / education
  • Specialty Boards*
  • Traumatology* / education
  • Turkey