Background Large language models (LLMs) offer potential to reduce administrative burden in clinical care by generating discharge summaries.Most prior evaluations have been limited to drafts, small cohorts, or nonintegrated settings.Robust validation of fully automated, EHR-integrated systems in real-world practice is lacking. MethodsThis study was conducted in April 2025 at a Dutch academic hospital.A total of 292 paired discharge summaries from multiple departments were evaluated, each consisting of a physician-written and an LLMgenerated version.Summaries were independently assessed by eight blinded clinicians using a 5-point Likert scale across completeness, correctness, and conciseness.Trustworthiness was also scored.Domain and total scores were compared with Wilcoxon signed-rank tests, and interrater reliability was quantified using Gwet's AC2.Findings LLM-generated summaries had lower completeness (4.50 (4.00-5.00)vs 5.00 (4.50-5.00);p < 0.001), similar correctness (5.00 (4.50-5.00)vs 5.00 (4.63-5.00);p = 0.14), and greater conciseness (5.00 (4.50-5.00)vs 4.50 (4.00-5.00);p < 0.001) compared with physician-written summaries.Total scores did not differ (14.00 (13.00-15.00)vs 14.00 (13.00-15.00);p = 0.34).Physician-written summaries were trusted by both reviewers in 279 (95.5%) cases, whereas LLM-generated summaries were trusted in 249 (85.3%) cases, partially trusted in 34 (11.6%), and rejected in 9 (3.1%).Interrater agreement for total scores was high (AC2 0.87, 95% CI 0.83-0.90for LLM; 0.85, 95% CI 0.81-0.89for physician).Interpretation Discharge summaries generated by an EHR-integrated LLM achieved quality ratings comparable to physician-written documents across multiple specialties, with no difference in total scores.Unlike earlier pilot work, this study demonstrates real-world feasibility of automated LLM use in clinical workflows at scale.With appropriate oversight and specialty-specific refinement, such systems could substantially reduce documentation burden while maintaining discharge summary quality.
Mehri et al. (Thu,) studied this question.