Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine

Qiao Jin; Fangyuan Chen; Yiliang Zhou; Ziyang Xu; Justin M Cheung; Robert Chen; Ronald M Summers; Justin F Rousseau; Peiyun Ni; Marc J Landsman; Sally L Baxter; Subhi J Al'Aref; Yijia Li; Alex Chen; Josef A Brejt; Michael F Chiang; Yifan Peng; Zhiyong Lu

Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine

ArXiv [Preprint]. 2024 Apr 22:arXiv:2401.08396v3.

Authors

Qiao Jin¹, Fangyuan Chen², Yiliang Zhou³, Ziyang Xu⁴, Justin M Cheung⁵, Robert Chen⁶, Ronald M Summers⁷, Justin F Rousseau⁸, Peiyun Ni⁹, Marc J Landsman¹⁰, Sally L Baxter¹¹, Subhi J Al'Aref¹², Yijia Li¹³, Alex Chen¹⁴, Josef A Brejt¹⁴, Michael F Chiang¹⁵, Yifan Peng³, Zhiyong Lu¹

Affiliations

¹ National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
² University of Pittsburgh, Pittsburgh, PA, USA.
³ Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
⁴ Ronald O. Perelman Department of Dermatology, New York University Grossman School of Medicine, New York City, NY, USA.
⁵ Department of Medicine, Harvard Medical School and Massachusetts General Hospital, Boston, MA, USA.
⁶ Pathology & Laboratory Medicine, Weill Cornell Medicine, New York, NY, USA.
⁷ Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, Department of Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD, USA.
⁸ Department of Neurology, Peter O'Donnell Jr. Brain Institute, UT Southwestern Medical Center, Dallas, TX, USA.
⁹ Division of Gastroenterology, Department of Medicine, Harvard Medical School and Massachusetts General Hospital, Boston, MA, USA.
¹⁰ Division of Gastroenterology, Department of Medicine, Metrohealth Medical Center, Cleveland, OH, USA. Case Western Reserve University School of Medicine, Cleveland, OH, USA.
¹¹ Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, La Jolla, CA, USA.
¹² Division of Cardiology, Department of Internal Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, USA.
¹³ University of Pittsburgh Medical Center, Pittsburgh, PA, USA.
¹⁴ Department of Internal Medicine, Weill Cornell Medicine, New York, NY, USA.
¹⁵ National Eye Institute, National Institutes of Health, Bethesda, MD, USA.

PMID: 38410646
PMCID: PMC10896362

Abstract

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

Publication types

Preprint

Grants and funding

R01 LM014344/LM/NLM NIH HHS/United States