Frequently asked questions on erectile dysfunction: evaluating artificial intelligence answers with expert mentorship

Muharrem Baturu; Mehmet Solakhan; Tanyeli Guneyligil Kazaz; Omer Bayrak

doi:10.1038/s41443-024-00898-3

Frequently asked questions on erectile dysfunction: evaluating artificial intelligence answers with expert mentorship

Int J Impot Res. 2024 May 7. doi: 10.1038/s41443-024-00898-3. Online ahead of print.

Authors

Muharrem Baturu¹, Mehmet Solakhan², Tanyeli Guneyligil Kazaz³, Omer Bayrak⁴

Affiliations

¹ Department of Urology, University of Gaziantep, Gaziantep, Turkey.
² Department of Urology, Hasan Kalyoncu University, Gaziantep, Turkey.
³ Department of Biostatistics, University of Gaziantep, Gaziantep, Turkey.
⁴ Department of Urology, University of Gaziantep, Gaziantep, Turkey. dromerbayrak@yahoo.com.

PMID: 38714784
DOI: 10.1038/s41443-024-00898-3

Abstract

The present study assessed the accuracy of artificiaI intelligence-generated responses to frequently asked questions on erectile dysfunction. A cross-sectional analysis involved 56 erectile dysfunction-related questions searched on Google, categorized into nine sections: causes, diagnosis, treatment options, treatment complications, protective measures, relationship with other illnesses, treatment costs, treatment with herbal agents, and appointments. Responses from ChatGPT 3.5, ChatGPT 4, and BARD were evaluated by two experienced urology experts using the F1 and global quality scores (GQS) for accuracy, relevance, and comprehensibility. ChatGPT 3.5 and ChatGPT 4 achieved higher GQS than BARD in categories such as causes (4.5 ± 0.54, 4.5 ± 0.51, 3.15 ± 1.01, respectively, p < 0.001), treatment options (4.35 ± 0.6, 4.5 ± 0.43, 2.71 ± 1.38, respectively, p < 0.001), protective measures (5.0 ± 0, 5.0 ± 0, 4 ± 0.5, respectively, p = 0.013), relationships with other illnesses (4.58 ± 0.58, 4.83 ± 0.25, 3.58 ± 0.8, respectively, p = 0.006), and treatment with herbal agents (3 ± 0.61, 3.33 ± 0.83, 1.8 ± 1.09, respectively, p = 0.043). F1 scores in categories: causes (1), diagnosis (0.857), treatment options (0.726), and protective measures (1), indicated their alignment with the guidelines. There was no significant difference between ChatGPT 3.5 and ChatGPT 4 regarding answer quality, but both outperformed BARD in the GQS. These results emphasize the need to continually enhance and validate AI-generated medical information, underscoring the importance of artificiaI intelligence systems in delivering reliable information on erectile dysfunction.