What question did this study set out to answer?

This study aims to assess the role of large language models in supporting clinical decision-making across various specialties and experience levels.

June 17, 2026

Augmenting clinical decision-making with large language models: evaluation across general and specialty tasks

Key Points

This study aims to assess the role of large language models in supporting clinical decision-making across various specialties and experience levels.
Evaluated 3 LLMs (Deepseek-R1, GPT-4o-mini, LLaMA-4) with general-disease and prostate disease tasks.
Clinicians of differing experience levels completed tasks before and after reviewing LLM responses.
Performance rated on a 5-point Likert scale, analyzed using Wilcoxon signed-rank test.
LLM assistance significantly improved clinician performance in general disease tasks (P < .05).
Junior clinicians' scores increased by 15.9%-20.8% in specialty scenarios, surpassing senior clinicians' unaided performance (P < .05).
Deepseek-R1 outperformed others in diagnostic tasks.

Abstract

OBJECTIVE: Large language models (LLMs) have shown promise in clinical applications, yet prior studies mainly evaluated their standalone performance on benchmarks or examinations. This study assesses how LLMs support clinicians across specialties, disease contexts, experience levels and decision-making stages. METHODS: We evaluated 3 LLMs (Deepseek-R1, GPT-4o-mini and LLaMA-4) using 2 task types: (1) general-disease tasks spanning specialties and disease incidence levels, and (2) prostate disease scenario covering the clinical workflow, including diagnosis, treatment planning, postoperative rehabilitation and prognosis. Clinicians with different seniority completed tasks independently and repeated them after reviewing LLM-generated responses. Performance was rated by experts using a 5-point Likert scale, and differences were analyzed with the Wilcoxon signed-rank test. RESULTS: LLM assistance significantly improved clinician performance across general disease tasks (P .05). Performance varied among models, with Deepseek-R1 performing best in diagnostic tasks. DISCUSSION: These findings suggest that LLM assistance may provide greater benefit to less-experienced clinicians and are particularly effective in downstream decision-making stages, indicating a potential role in mitigating experience-related performance disparities. CONCLUSIONS: LLMs may serve as supportive clinical-decision support tools across diseases, specialties, clinician experience levels, and workflow stages, with observed improvements in decision quality and reduced performance disparities.

Ask AI

Helpful

Bookmark