Assessing the Efficacy of Large Language Models in Health Literacy: A Comprehensive Cross-Sectional Study

Kanhai S Amin; Linda C Mayes; Pavan Khosla; Rushabh H Doshi

doi:10.59249/ZTOZ1966

Assessing the Efficacy of Large Language Models in Health Literacy: A Comprehensive Cross-Sectional Study

Yale J Biol Med. 2024 Mar 29;97(1):17-27. doi: 10.59249/ZTOZ1966. eCollection 2024 Mar.

Authors

Kanhai S Amin¹, Linda C Mayes², Pavan Khosla³, Rushabh H Doshi³

Affiliations

¹ Yale College, New Haven, CT, USA.
² Yale Child Study Center, Yale School of Medicine, New Haven, CT, USA.
³ Yale School of Medicine, New Haven, CT, USA.

Abstract

Enhanced health literacy in children has been empirically linked to better health outcomes over the long term; however, few interventions have been shown to improve health literacy. In this context, we investigate whether large language models (LLMs) can serve as a medium to improve health literacy in children. We tested pediatric conditions using 26 different prompts in ChatGPT-3.5, ChatGPT-4, Microsoft Bing, and Google Bard (now known as Google Gemini). The primary outcome measurement was the reading grade level (RGL) of output as assessed by Gunning Fog, Flesch-Kincaid Grade Level, Automated Readability Index, and Coleman-Liau indices. Word counts were also assessed. Across all models, output for basic prompts such as "Explain" and "What is (are)," were at, or exceeded, the tenth-grade RGL. When prompts were specified to explain conditions from the first- to twelfth-grade level, we found that LLMs had varying abilities to tailor responses based on grade level. ChatGPT-3.5 provided responses that ranged from the seventh-grade to college freshmen RGL while ChatGPT-4 outputted responses from the tenth-grade to the college senior RGL. Microsoft Bing provided responses from the ninth- to eleventh-grade RGL while Google Bard provided responses from the seventh- to tenth-grade RGL. LLMs face challenges in crafting outputs below a sixth-grade RGL. However, their capability to modify outputs above this threshold, provides a potential mechanism for adolescents to explore, understand, and engage with information regarding their health conditions, spanning from simple to complex terms. Future studies are needed to verify the accuracy and efficacy of these tools.

Keywords: Artificial Intelligence; ChatGPT; Google Bard; Google Gemini; Health Literacy; Large Language Models; Microsoft Bing; Pediatrics; Reading Grade Level.

MeSH terms

Adolescent
Child
Comprehension
Cross-Sectional Studies
Health Literacy*
Humans
Language
Reading