Purpose: This study aimed to evaluate the content validity and inter-rater reliability of stuttering assessment and intervention programs generated by artificial intelligence (GPT-4) in both Turkish and English for preschool, school-age, and adult populations. It also examined whether linguistic or cultural differences affected expert evaluations.
Methods: Twelve AI-generated programs (six in Turkish, six in English) were reviewed by twelve certified speech-language pathologists specializing in fluency disorders. Each item was rated using a 5-point Likert scale. Descriptive statistics, Cronbach's Alpha, and Intraclass Correlation Coefficients (ICC) were calculated to assess consistency and reliability.
Results: The majority of items were rated as appropriate or highly appropriate (M = 4.6-4.9). The overall reliability among raters was poor (ICC = 0.45), while single-rater reliability was higher (ICC = 0.65). Only a small number of items were flagged for revision, typically involving emotional or contextual components. Experts noted that English versions tended to be more detailed and literature-consistent, whereas certain Turkish terms required clearer cultural adaptation.
Conclusion: GPT-4 can produce clinically relevant and linguistically accurate stuttering materials when paired with expert review. However, human validation remains essential to refine affective and culture-specific elements. These findings support the integration of AI-assisted tools in multilingual clinical content development.
Keywords: Artificial intelligence; Content validity; Stuttering.
Copyright © 2025 Elsevier Inc. All rights reserved.