ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination

Andrew Mihalache; Ryan S Huang; Marko M Popovic; Rajeev H Muni

doi:10.1080/0142159X.2023.2249588

ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination

Med Teach. 2024 Mar;46(3):366-372. doi: 10.1080/0142159X.2023.2249588. Epub 2023 Oct 15.

Authors

Andrew Mihalache¹, Ryan S Huang¹, Marko M Popovic², Rajeev H Muni^{2

3}

Affiliations

¹ Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
² Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada.
³ Department of Ophthalmology, St. Michael's Hospital/Unity Health Toronto, Toronto, Ontario, Canada.

PMID: 37839017
DOI: 10.1080/0142159X.2023.2249588

Abstract

Purpose: ChatGPT-4 is an upgraded version of an artificial intelligence chatbot. The performance of ChatGPT-4 on the United States Medical Licensing Examination (USMLE) has not been independently characterized. We aimed to assess the performance of ChatGPT-4 at responding to USMLE Step 1, Step 2CK, and Step 3 practice questions.

Method: Practice multiple-choice questions for the USMLE Step 1, Step 2CK, and Step 3 were compiled. Of 376 available questions, 319 (85%) were analyzed by ChatGPT-4 on March 21^st, 2023. Our primary outcome was the performance of ChatGPT-4 for the practice USMLE Step 1, Step 2CK, and Step 3 examinations, measured as the proportion of multiple-choice questions answered correctly. Our secondary outcomes were the mean length of questions and responses provided by ChatGPT-4.

Results: ChatGPT-4 responded to 319 text-based multiple-choice questions from USMLE practice test material. ChatGPT-4 answered 82 of 93 (88%) questions correctly on USMLE Step 1, 91 of 106 (86%) on Step 2CK, and 108 of 120 (90%) on Step 3. ChatGPT-4 provided explanations for all questions. ChatGPT-4 spent 30.8 ± 11.8 s on average responding to practice questions for USMLE Step 1, 23.0 ± 9.4 s per question for Step 2CK, and 23.1 ± 8.3 s per question for Step 3. The mean length of practice USMLE multiple-choice questions that were answered correctly and incorrectly by ChatGPT-4 was similar (difference = 17.48 characters, SE = 59.75, 95%CI = [-100.09,135.04], t = 0.29, p = 0.77). The mean length of ChatGPT-4's correct responses to practice questions was significantly shorter than the mean length of incorrect responses (difference = 79.58 characters, SE = 35.42, 95%CI = [9.89,149.28], t = 2.25, p = 0.03).

Conclusions: ChatGPT-4 answered a remarkably high proportion of practice questions correctly for USMLE examinations. ChatGPT-4 performed substantially better at USMLE practice questions than previous models of the same AI chatbot.

Keywords: United States medical licensing examination; artificial intelligence; chatgpt-4; natural language processing.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence*
Humans
Licensure
Physical Examination
Software*