A large language model's assessment of methodology reporting in head and neck surgery

Rushil Dang; Curtis Hanba

doi:10.1016/j.amjoto.2023.104145

A large language model's assessment of methodology reporting in head and neck surgery

Am J Otolaryngol. 2024 Mar-Apr;45(2):104145. doi: 10.1016/j.amjoto.2023.104145. Epub 2023 Dec 6.

Authors

Rushil Dang¹, Curtis Hanba²

Affiliations

¹ Maxillofacial Oncology and Reconstructive Surgery, Department of Oral and Maxillofacial surgery, Boston Medical Center, Boston, MA, USA.
² Department of Head and Neck Surgery, The University of Texas MD Anderson Cancer Center, Houston, TX, USA. Electronic address: curtis.j.hanba@gmail.com.

PMID: 38103488
DOI: 10.1016/j.amjoto.2023.104145

Abstract

Objective: The aim of this study was to assess the ability of a Large Language Model - ChatGPT 3.5 to appraise the quality of scientific methodology reporting in head and neck specific scientific literature.

Methods: Authors asked ChatGPT 3.5 to create a grading system for scientific reporting of research methods. The language model produced a system with a max of 60 points. Individual scores were provided for Study Design and Description, Data Collection and Measurement, Statistical Analysis, Ethical Considerations, and Overall Clarity and Transparency. Twenty articles were selected at random from The American Head and Neck Society's (AHNS) fellowship curriculum 2.0 for interrogation and each 'Methods' section was input into ChatGPT 3.5 for scoring. Analysis of variance (ANOVA) was performed between different scoring categories and a post-hoc tukey HSD test was performed.

Results: Twenty articles were assessed, eight were categorized as very good and nine as good based on cumulative score. Lowest mean score was noted with category of statistical analysis (Mean = 0.49, SD = 0.02). On ANOVA a significant difference between means of the different scoring categories was noted, F(4, 95) = 13.4, p ≤ 0.05. On post-hoc Tukey HSD test, mean scores for categories of data collection (Mean = 0.58, SD = 0.06) and statistical analysis (Mean = 0.49, SD = 0.02) were significantly lower when compared to other categories.

Conclusion: This article showcases the feasibility of employing a large language model such as ChatGPT 3.5 to assess the methods sections in head and neck academic writing.

Keywords: AI; ChatGPT; Head and neck; Large language model.

MeSH terms

Analysis of Variance
Curriculum*
Head
Humans
Language
Research Design*