Can ChatGPT pass the urology fellowship examination? Artificial intelligence capability in surgical training assessment

BJU Int. 2025 Sep;136(3):523-528. doi: 10.1111/bju.16806. Epub 2025 Jun 19.

Abstract

Objectives: To assess the performance of ChatGPT compared to human trainees in the Australian Urology written fellowship examination (essay format).

Materials and methods: Each examination was marked independently by two blinded examining urologists and assessed for: overall pass/failure; proportion of passing questions; and adjusted aggregate score. Examining urologists also made a blinded judgement as to authorship (artificial intelligence [AI] or trainee).

Results: A total of 20 examination papers were marked; 10 completed by urology trainees and 10 by AI platforms (half each on ChatGPT-3.5 and -4.0). Overall, 9/10 of trainees successfully passed the urology fellowship, whereas only 6/10 of ChatGPT examinations passed (P = 0.3). Of the ChatGPT failing examinations, 3/4 were undertaken by the ChatGPT-3.5 platform. The proportion of passing questions per examination was higher in trainees compared to ChatGPT: mean 89.4% vs 80.9% (P = 0.2). The adjusted aggregate scores of trainees were also higher than those of ChatGPT by a small margin: mean 79.2% vs 78.1% (P = 0.8). ChatGPT-3.5 and ChatGPT-4.0 achieved similar aggregate scores (78.9% and 77.4%, P = 0.8). However, ChatGPT-3.5 had a lower percentage of passing questions per examination: mean 79.6% vs 82.1% (P = 0.8). Two examinations were incorrectly assigned by examining urologists (both trainee candidates perceived to be ChatGPT); therefore, the sensitivity for identifying ChatGPT authorship was 100% and overall accuracy was 91.7%.

Conclusion: Overall, ChatGPT did not perform as well as human trainees in the Australian Urology fellowship written examination. Examiners were able to identify AI-generated answers with a high degree of accuracy.

Keywords: ChatGPT; artificial intelligence; specialty examination; surgical education and training; urology.

MeSH terms

  • Artificial Intelligence*
  • Australia
  • Clinical Competence*
  • Educational Measurement* / methods
  • Fellowships and Scholarships*
  • Generative Artificial Intelligence
  • Humans
  • Urology* / education