Popular large language model chatbots' accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries

Krithi Pushpanathan; Zhi Wei Lim; Samantha Min Er Yew; David Ziyou Chen; Hazel Anne Hui'En Lin; Jocelyn Hui Lin Goh; Wendy Meihua Wong; Xiaofei Wang; Marcus Chun Jin Tan; Victor Teck Chang Koh; Yih-Chung Tham

doi:10.1016/j.isci.2023.108163

Popular large language model chatbots' accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries

iScience. 2023 Oct 10;26(11):108163. doi: 10.1016/j.isci.2023.108163. eCollection 2023 Nov 17.

Authors

Krithi Pushpanathan^{1

2}, Zhi Wei Lim¹, Samantha Min Er Yew^{1

2}, David Ziyou Chen^{1

2

3}, Hazel Anne Hui'En Lin^{1

2

3}, Jocelyn Hui Lin Goh⁴, Wendy Meihua Wong^{1

2

3}, Xiaofei Wang^{5

6}, Marcus Chun Jin Tan^{1

2

3}, Victor Teck Chang Koh^{1

2

3}, Yih-Chung Tham^{1

2

4

7}

Affiliations

¹ Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
² Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
³ Department of Ophthalmology, National University Hospital, Singapore, Singapore.
⁴ Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore.
⁵ Key Laboratory for Biomechanics and Mechanobiology of Ministry of Education, Beijing, China.
⁶ Advanced Innovation Centre for Biomedical Engineering, School of Biological Science and Medical Engineering, Beihang University, Beijing, China.
⁷ Ophthalmology and Visual Sciences Academic Clinical Programme (Eye ACP), Duke NUS Medical School, Singapore, Singapore.

Abstract

In light of growing interest in using emerging large language models (LLMs) for self-diagnosis, we systematically assessed the performance of ChatGPT-3.5, ChatGPT-4.0, and Google Bard in delivering proficient responses to 37 common inquiries regarding ocular symptoms. Responses were masked, randomly shuffled, and then graded by three consultant-level ophthalmologists for accuracy (poor, borderline, good) and comprehensiveness. Additionally, we evaluated the self-awareness capabilities (ability to self-check and self-correct) of the LLM-Chatbots. 89.2% of ChatGPT-4.0 responses were 'good'-rated, outperforming ChatGPT-3.5 (59.5%) and Google Bard (40.5%) significantly (all p < 0.001). All three LLM-Chatbots showed optimal mean comprehensiveness scores as well (ranging from 4.6 to 4.7 out of 5). However, they exhibited subpar to moderate self-awareness capabilities. Our study underscores the potential of ChatGPT-4.0 in delivering accurate and comprehensive responses to ocular symptom inquiries. Future rigorous validation of their performance is crucial to ensure their reliability and appropriateness for actual clinical use.

Keywords: Artificial intelligence; Ophthalmology.