Ambient AI Scribes in Clinical Practice: A Randomized Trial

NEJM AI. 2025 Dec;2(12):10.1056/aioa2501000. doi: 10.1056/aioa2501000. Epub 2025 Nov 26.

Abstract

Background: Ambient artificial intelligence (AI) scribes record patient encounters and rapidly generate visit notes, representing a promising solution to documentation burden and physician burnout. However, the scribes' impacts have not been examined in randomized clinical trials.

Methods: In this parallel three-group pragmatic randomized clinical trial, 238 outpatient physicians, representing 14 specialties, were assigned 1:1:1 via covariate-constrained randomization (balancing on time-in-note, baseline burnout score, and clinic days per week) to either one of two AI scribe applications - Microsoft Dragon Ambient eXperience (DAX) Copilot or Nabla - or a usual-care control group from November 4, 2024, to January 3, 2025. The primary outcome was the change from baseline log writing time-in-note. Secondary end points measured by surveys included the Mini-Z 2.0, a four-item physician task load (PTL), and Professional Fulfillment Index - Work Exhaustion (PFI-WE) scores to evaluate aspects of burnout; work environment; stress; and targeted questions addressing safety, accuracy, and usability.

Results: DAX was used in 33.5% of 24,696 visits; Nabla was used in 29.5% of 23,653 visits. Nabla users experienced a 9.5% (95% confidence interval [CI], -17.2% to -1.8%; P=0.02) decrease in time-in-note versus the control group, whereas DAX users exhibited no significant change versus the control group (-1.7%; 95% CI, -9.4% to +5.9%; P=0.66). Increases in total Mini-Z (scale 10-50; DAX 2.83 [95% CI, +1.28 to +4.37]; Nabla +2.69 [95% CI, +1.14 to +4.23]) and reductions in PTL (scale 0-400; DAX -39.9 [95% CI, -71.9 to -7.9]; Nabla -31.7 [95% CI, -63.8 to +0.4]), and PFI-WE (scale 0-4; DAX 0.32 [95% CI,-0.55 to -0.08]; Nabla -0.23 [95% CI, -0.46 to +0.01]) scores suggest improvement for users of either scribe versus the control. One grade 1 (mild) adverse event was reported, while clinically significant inaccuracies were noted "occasionally" on five-point Likert questions (DAX 2.7 [95% CI, 2.4 to 3.0]; Nabla 2.8 [95% CI, 2.6 to 3.0]).

Conclusions: Nabla reduced time-in-note versus the control. Both DAX and Nabla resulted in potential improvements in burnout, task load, and work exhaustion, but these secondary end point findings need confirmation in larger, multicenter trials. Clinicians reported that performance was similar across the two distinct platforms, and occasional inaccuracies observed in either scribe require ongoing vigilance. (Funded by the University of California, Los Angeles, Department of Medicine and others; ClinicalTrials.gov number, NCT06792890.).

Associated data

  • ClinicalTrials.gov/NCT06792890