Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions-an observational study

Raju Vaishya; Karthikeyan P Iyengar; Mohit Kumar Patralekh; Rajesh Botchu; Kapil Shirodkar; Vijay Kumar Jain; Abhishek Vaish; Marius M Scarlat

doi:10.1007/s00264-024-06182-9

Effectiveness of AI-powered Chatbots in responding to orthopaedic postgraduate exam questions-an observational study

Int Orthop. 2024 Apr 15. doi: 10.1007/s00264-024-06182-9. Online ahead of print.

Authors

Raju Vaishya¹, Karthikeyan P Iyengar², Mohit Kumar Patralekh³, Rajesh Botchu⁴, Kapil Shirodkar⁴, Vijay Kumar Jain⁵, Abhishek Vaish⁶, Marius M Scarlat⁷

Affiliations

¹ Department of Orthopaedics, Indraprastha Apollo Hospitals, Sarita Vihar, New Delhi, 110076, India. raju.vaishya@gmail.com.
² Department of Orthopaedics, Southport and Ormskirk Hospital, Mersey West Lancashire Teaching NHS Trust, Southport, UK.
³ Department of Orthopaedics, Safdarjung Hospital, New Delhi, India.
⁴ Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, UK.
⁵ Department of Orthopaedics, RML Hospital, New Delhi, India.
⁶ Department of Orthopaedics, Indraprastha Apollo Hospitals, Sarita Vihar, New Delhi, 110076, India.
⁷ Clinique Chirurgicale St Michel, Groupe ELSAN Toulon, France.

PMID: 38619565
DOI: 10.1007/s00264-024-06182-9

Abstract

Purpose: This study analyses the performance and proficiency of the three Artificial Intelligence (AI) generative chatbots (ChatGPT-3.5, ChatGPT-4.0, Bard Google AI®) and in answering the Multiple Choice Questions (MCQs) of postgraduate (PG) level orthopaedic qualifying examinations.

Methods: A series of 120 mock Single Best Answer' (SBA) MCQs with four possible options named A, B, C and D as answers on various musculoskeletal (MSK) conditions covering Trauma and Orthopaedic curricula were compiled. A standardised text prompt was used to generate and feed ChatGPT (both 3.5 and 4.0 versions) and Google Bard programs, which were then statistically analysed.

Results: Significant differences were found between responses from Chat GPT 3.5 with Chat GPT 4.0 (Chi square = 27.2, P < 0.001) and on comparing both Chat GPT 3.5 (Chi square = 63.852, P < 0.001) with Chat GPT 4.0 (Chi square = 44.246, P < 0.001) with. Bard Google AI® had 100% efficiency and was significantly more efficient than both Chat GPT 3.5 with Chat GPT 4.0 (p < 0.0001).

Conclusion: The results demonstrate the variable potential of the different AI generative chatbots (Chat GPT 3.5, Chat GPT 4.0 and Bard Google) in their ability to answer the MCQ of PG-level orthopaedic qualifying examinations. Bard Google AI® has shown superior performance than both ChatGPT versions, underlining the potential of such large language processing models in processing and applying orthopaedic subspecialty knowledge at a PG level.

Keywords: Artificial intelligence; Bard; ChatGPT; Chatbots; Medical Education; Multiple-choice question; Orthopaedics.