Performance of Artificial Intelligence (AI)-Powered Chatbots in the Assessment of Medical Case Reports: Qualitative Insights From Simulated Scenarios

Florian Reis; Christian Lenz

doi:10.7759/cureus.53899

Performance of Artificial Intelligence (AI)-Powered Chatbots in the Assessment of Medical Case Reports: Qualitative Insights From Simulated Scenarios

Cureus. 2024 Feb 9;16(2):e53899. doi: 10.7759/cureus.53899. eCollection 2024 Feb.

Authors

Florian Reis¹, Christian Lenz¹

Affiliation

¹ Medical Affairs, Pfizer Pharma GmbH, Berlin, DEU.

Abstract

Introduction With the expanding awareness and use of AI-powered chatbots, it seems possible that an increasing number of people could use them to assess and evaluate their medical symptoms. If chatbots are used for this purpose, that have not previously undergone a thorough medical evaluation for this specific use, various risks might arise. The aim of this study is to analyze and compare the performance of popular chatbots in differentiating between severe and less critical medical symptoms described from a patient's perspective and to examine the variations in substantive medical assessment accuracy and empathetic communication style among the chatbots' responses. Materials and methods Our study compared three different AI-supported chatbots - OpenAI's ChatGPT 3.5, Microsoft's Bing Chat, and Inflection's Pi AI. Three exemplary case reports for medical emergencies as well as three cases without an urgent reason for an emergency medical admission were constructed and analyzed. Each case report was accompanied by identical questions concerning the most likely suspected diagnosis and the urgency of an immediate medical evaluation. The respective answers of the chatbots were qualitatively compared with each other regarding the medical accuracy of the differential diagnoses mentioned and the conclusions drawn, as well as regarding patient-oriented and empathetic language. Results All examined chatbots were capable of providing medically plausible and probable diagnoses and classifying situations as acute or less critical. However, their responses varied slightly in the level of their urgency assessment. Clear differences could be seen in the level of detail of the differential diagnoses, the overall length of the answers, and how the chatbot dealt with the challenge of being confronted with medical issues. All given answers were comparable in terms of empathy level and comprehensibility. Conclusion Even AI chatbots that are not designed for medical applications already offer substantial guidance in assessing typical medical emergency indications but should always be provided with a disclaimer. In responding to medical queries, characteristic differences emerge among chatbots in the extent and style of their respective answers. Given the lack of medical supervision of many established chatbots, subsequent studies, and experiences are essential to clarify whether a more extensive use of these chatbots for medical concerns will have a positive impact on healthcare or rather pose major medical risks.

Keywords: ai chatbots; ai in healthcare; artificial intelligence; chatgpt; symptom checker.