neuroGPT-X: toward a clinic-ready large language model

Edward Guo; Mehul Gupta; Sarthak Sinha; Karl Rössler; Marcos Tatagiba; Ryojo Akagami; Ossama Al-Mefty; Taku Sugiyama; Philip E Stieg; Gwynedd E Pickett; Madeleine de Lotbiniere-Bassett; Rahul Singh; Sanju Lama; Garnette R Sutherland

doi:10.3171/2023.7.JNS23573

neuroGPT-X: toward a clinic-ready large language model

J Neurosurg. 2023 Oct 6;140(4):1041-1053. doi: 10.3171/2023.7.JNS23573. Print 2024 Apr 1.

Authors

Edward Guo^{1

2}, Mehul Gupta¹, Sarthak Sinha¹, Karl Rössler³, Marcos Tatagiba⁴, Ryojo Akagami⁵, Ossama Al-Mefty⁶, Taku Sugiyama⁷, Philip E Stieg⁸, Gwynedd E Pickett⁹, Madeleine de Lotbiniere-Bassett^{1

2}, Rahul Singh^{1

2}, Sanju Lama^{1

2}, Garnette R Sutherland^{1

2}

Affiliations

¹ 1Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.
² 2Department of Clinical Neurosciences, Project neuroArm, Hotchkiss Brain Institute University of Calgary, Calgary, Alberta, Canada.
³ 3Department of Neurosurgery, Medical University of Vienna, Vienna, Austria.
⁴ 4Department of Neurosurgery, Tubingen University, Tubingen, Germany.
⁵ 5Department of Surgery, University of British Columbia, Vancouver, British Columbia, Canada.
⁶ 6Department of Neurosurgery, Harvard School of Medicine, Boston, Massachusetts.
⁷ 7Department of Neurosurgery, Hokkaido University Graduate School of Medicine, Sapporo, Japan.
⁸ 8Department of Neurosurgery, Weill Cornell Medicine/NewYork-Presbyterian Hospital, New York, New York; and.
⁹ 9Department of Surgery, Dalhousie University, Halifax, Nova Scotia, Canada.

PMID: 38564804
DOI: 10.3171/2023.7.JNS23573

Abstract

Objective: The objective was to assess the performance of a context-enriched large language model (LLM) compared with international neurosurgical experts on questions related to the management of vestibular schwannoma. Furthermore, another objective was to develop a chat-based platform incorporating in-text citations, references, and memory to enable accurate, relevant, and reliable information in real time.

Methods: The analysis involved 1) creating a data set through web scraping, 2) developing a chat-based platform called neuroGPT-X, 3) enlisting 8 expert neurosurgeons across international centers to independently create questions (n = 1) and to answer (n = 4) and evaluate responses (n = 3) while blinded, and 4) analyzing the evaluation results on the management of vestibular schwannoma. In the blinded phase, all answers were assessed for accuracy, coherence, relevance, thoroughness, speed, and overall rating. All experts were unblinded and provided their thoughts on the utility and limitations of the tool. In the unblinded phase, all neurosurgeons provided answers to a Likert scale survey and long-answer questions regarding the clinical utility, likelihood of use, and limitations of the tool. The tool was then evaluated on the basis of a set of 103 consensus statements on vestibular schwannoma care from the 8th Quadrennial International Conference on Vestibular Schwannoma.

Results: Responses from the naive and context-enriched Generative Pretrained Transformer (GPT) models were consistently rated not significantly different in terms of accuracy, coherence, relevance, thoroughness, and overall performance, and they were often rated significantly higher than expert responses. Both the naive and content-enriched GPT models provided faster responses to the standardized question set than expert neurosurgeon respondents (p < 0.01). The context-enriched GPT model agreed with 98 of the 103 (95%) consensus statements. Of interest, all expert surgeons expressed concerns about the reliability of GPT in accurately addressing the nuances and controversies surrounding the management of vestibular schwannoma. Furthermore, the authors developed neuroGPT-X, a chat-based platform designed to provide point-of-care clinical support and mitigate the limitations of human memory. neuroGPT-X incorporates features such as in-text citations and references to enable accurate, relevant, and reliable information in real time.

Conclusions: The present study, with its subspecialist-level performance in generating written responses to complex neurosurgical problems for which evidence-based consensus for management is lacking, suggests that context-enriched LLMs show promise as a point-of-care medical resource. The authors anticipate that this work will be a springboard for expansion into more medical specialties, incorporating evidence-based clinical information and developing expert-level dialogue surrounding LLMs in healthcare.

Keywords: GPT; acoustic schwannoma; large language models; neuroGPT-X; vestibular schwannoma.

MeSH terms

Humans
Language
Medicine*
Neuroma, Acoustic* / surgery
Neurosurgeons
Reproducibility of Results