Utility and Comparative Performance of Current Artificial Intelligence Large Language Models as Postoperative Medical Support Chatbots in Aesthetic Surgery

Jad Abi-Rafeh; Nader Henry; Hong Hao Xu; Brian Bassiri-Tehrani; Adel Arezki; Roy Kazan; Mirko S Gilardino; Foad Nahai

doi:10.1093/asj/sjae025

Utility and Comparative Performance of Current Artificial Intelligence Large Language Models as Postoperative Medical Support Chatbots in Aesthetic Surgery

Aesthet Surg J. 2024 Feb 6:sjae025. doi: 10.1093/asj/sjae025. Online ahead of print.

Authors

Jad Abi-Rafeh¹, Nader Henry¹, Hong Hao Xu², Brian Bassiri-Tehrani³, Adel Arezki⁴, Roy Kazan¹, Mirko S Gilardino¹, Foad Nahai⁵

Affiliations

¹ Division of Plastic, Reconstructive, and Aesthetic Surgery, McGill University Health Centre, Montreal, Quebec, Canada.
² Department of Medicine, Laval University, Quebec City, Quebec, Canada.
³ Plastic surgeon in private practice, New York, New York, USA.
⁴ Division of Urology, McGill University Health Centre, Montreal, Quebec, Canada.
⁵ Division of Plastic and Reconstructive Surgery, Emory University School of Medicine, Atlanta, GA, USA.

PMID: 38318684
DOI: 10.1093/asj/sjae025

Abstract

Background: Large Language Models (LLMs) have revolutionized the way plastic surgeons and their patients may access and leverage artificial Intelligence (AI).

Objectives: The present study aims to comparatively assess the performance of two current publically-available and patient-accessible LLMs in the potential application of AI as postoperative medical support chatbots in an aesthetic surgeon's practice.

Methods: Twenty-two simulated postoperative patient presentations following aesthetic breast plastic surgery were devised and expert-validated. Complications varied in their latency within the postoperative period, as well as urgency of required medical attention. In response to each patient-reported presentation, Open AI's ChatGPT and Google's Bard, in their unmodified and freely available versions, were objectively assessed for their comparative accuracy in generating an appropriate differential diagnosis, most likely diagnosis, suggested medical disposition, treatments or interventions to begin from home, and/or red flag signs/symptoms indicating deterioration.

Results: ChatGPT cumulatively and significantly outperformed Bard across all objective assessement metrics examined (66% vs. 55%, respectively; p < 0.05). Accuracy in generating an appropriate differential diagnosis were 61% for ChatGPT, and 57% for Bard (p = 0.45). ChatGPT asked an average of 9.2 questions on history, relative to 6.8 questions by Bard (p < 0.001), following which, accuracies of 91% vs. 68% at arriving at the most-likely diagnosis were noted, respectively (p < 0.01). Appropriate medical dispositions were suggested with an accuracy of 50% by ChatGPT, and 41% by Bard (p = 0.40); relevant home interventions/treatments with an accuracy of 59% and 55% (p = 0.94), and red flag signs/symptoms with accuracies of 79% and 54% (p < 0.01), respectively. Detailed and comparative performance breakdowns according to complication latency and urgency are presented herein.

Conclusions: ChatGPT represents the superior LLM for the potential application of AI technology in postoperative medical support chatbots. Imperfect performance and limitations identified herein may guide the necessary refinement to facilitate adoption.