Performance of ChatGPT vs. HuggingChat on OB-GYN Topics

Cureus. 2024 Mar 14;16(3):e56187. doi: 10.7759/cureus.56187. eCollection 2024 Mar.

Abstract

Background While large language models show potential as beneficial tools in medicine, their reliability, especially in the realm of obstetrics and gynecology (OB-GYN), is not fully comprehended. This study seeks to measure and contrast the performance of ChatGPT and HuggingChat in addressing OB-GYN-related medical examination questions, offering insights into their effectiveness in this specialized field. Methods ChatGPT and HuggingChat were subjected to two standardized multiple-choice question banks: Test 1, developed by the National Board of Medical Examiners (NBME), and Test 2, gathered from the Association of Professors of Gynecology & Obstetrics (APGO) Web-Based Interactive Self-Evaluation (uWISE). Responses were analyzed and compared for correctness. Results The two-proportion z-test revealed no statistically significant difference in performance between ChatGPT and HuggingChat on both medical examinations. For Test 1, ChatGPT scored 90%, while HuggingChat scored 85% (p = 0.6). For Test 2, ChatGPT correctly answered 70% of questions, while HuggingChat correctly answered 62% of questions (p = 0.4). Conclusion Awareness of the strengths and weaknesses of artificial intelligence allows for the proper and effective use of its knowledge. Our findings indicate that there is no statistically significant difference in performance between ChatGPT and HuggingChat in addressing medical inquiries. Nonetheless, both platforms demonstrate considerable promise for applications within the medical domain.

Keywords: artificial intelligence in medicine; chatgpt 3.5; gynecology and obstetrics; medical student assessment; nbme examinations.