What question did this study set out to answer?

This study aims to assess the quality of discharge summaries generated by large language models compared to those written by physicians in a clinical setting.

April 12, 2026Open Access

Assessing the quality of AI-generated and physician-written discharge summaries: evaluation of an EHR-integrated tool in a Dutch academic hospital

Key Points

This study aims to assess the quality of discharge summaries generated by large language models compared to those written by physicians in a clinical setting.
Evaluated 292 paired discharge summaries from various departments
Summaries rated by eight blinded clinicians using a 5-point Likert scale
Assessed completeness, correctness, and conciseness
Analyzed scores using Wilcoxon signed-rank tests
Measured interrater reliability with Gwet's AC2
LLM-generated summaries scored lower in completeness but higher in conciseness than physician-written summaries
Total scores were similar for both LLM-generated and physician-written summaries
High interrater reliability was observed for both summary types
Significantly more physician-written summaries were trusted by reviewers compared to LLM-generated summaries

Abstract

Background Large language models (LLMs) offer potential to reduce administrative burden in clinical care by generating discharge summaries.Most prior evaluations have been limited to drafts, small cohorts, or nonintegrated settings.Robust validation of fully automated, EHR-integrated systems in real-world practice is lacking. MethodsThis study was conducted in April 2025 at a Dutch academic hospital.A total of 292 paired discharge summaries from multiple departments were evaluated, each consisting of a physician-written and an LLMgenerated version.Summaries were independently assessed by eight blinded clinicians using a 5-point Likert scale across completeness, correctness, and conciseness.Trustworthiness was also scored.Domain and total scores were compared with Wilcoxon signed-rank tests, and interrater reliability was quantified using Gwet's AC2.Findings LLM-generated summaries had lower completeness (4.50 (4.00-5.00)vs 5.00 (4.50-5.00);p < 0.001), similar correctness (5.00 (4.50-5.00)vs 5.00 (4.63-5.00);p = 0.14), and greater conciseness (5.00 (4.50-5.00)vs 4.50 (4.00-5.00);p < 0.001) compared with physician-written summaries.Total scores did not differ (14.00 (13.00-15.00)vs 14.00 (13.00-15.00);p = 0.34).Physician-written summaries were trusted by both reviewers in 279 (95.5%) cases, whereas LLM-generated summaries were trusted in 249 (85.3%) cases, partially trusted in 34 (11.6%), and rejected in 9 (3.1%).Interrater agreement for total scores was high (AC2 0.87, 95% CI 0.83-0.90for LLM; 0.85, 95% CI 0.81-0.89for physician).Interpretation Discharge summaries generated by an EHR-integrated LLM achieved quality ratings comparable to physician-written documents across multiple specialties, with no difference in total scores.Unlike earlier pilot work, this study demonstrates real-world feasibility of automated LLM use in clinical workflows at scale.With appropriate oversight and specialty-specific refinement, such systems could substantially reduce documentation burden while maintaining discharge summary quality.

Bookmark

View Full Paper

Cite This Study

Mehri et al. (Thu,) studied this question.

synapsesocial.com/papers/69db36a04fe01fead37c4a04 https://doi.org/https://doi.org/10.1016/j.ebiom.2026.106247

Bookmark

View Full Paper