Background: Patients with limited English proficiency (LEP) face disproportionate risks at emergency department (ED) discharge. Professional interpretation improves outcomes, but real-time written translations remain difficult to provide in many EDs. Modern transformer-based large language models (LLMs) may offer improved translation quality compared with older systems, yet their performance on ad hoc provider-written ED discharge instructions is not well established.
Methods: We conducted a blinded cross-sectional non-inferiority study of English-language ED discharge instructions translated into Spanish, Brazilian Portuguese, and Simplified Chinese comparing Google Translate and ChatGPT-4o versus professional medical interpreters. Fifty-three randomly selected provider-written instructions (100-500 words, preserving spelling/grammar errors) were translated, yielding 477 unique translations. Professional medical interpreters, blinded to translation method, independently scored each translation on fluency, adequacy, meaning, and severity on a five-point Likert scale. Inter-rater reliability between the professional interpreter evaluations was calculated. A 0.5-point non-inferiority margin was pre-specified, and adjusted mean Likert rating differences generated by mixed effects models for each accuracy dimension were compared between translation methods for each language. The proportion of clinically significant translation errors was compared between methods, as was the ability of evaluators to guess the translation method.
Results: Inter-rater reliability was high across languages. Both machine translation methods were non-inferior to professional interpreters for adequacy, meaning, and severity in Spanish and Portuguese, and for all four domains in Chinese. For fluency, Google Translate and ChatGPT-4o were inferior in Spanish and Portuguese but non-inferior in Chinese. The frequency of clinically significant errors did not differ significantly by translation method. Evaluators, blinded to method, frequently misidentified machine translations as professional.
Conclusions: In this multi-language evaluation of real-world ED discharge instructions, Google Translate and ChatGPT-4o were non-inferior to professional interpreters for most domains of translation accuracy.
© 2026 by the Society for Academic Emergency Medicine.