OBJECTIVE: Large language models (LLMs) have shown promise in clinical applications, yet prior studies mainly evaluated their standalone performance on benchmarks or examinations. This study assesses how LLMs support clinicians across specialties, disease contexts, experience levels and decision-making stages. METHODS: We evaluated 3 LLMs (Deepseek-R1, GPT-4o-mini and LLaMA-4) using 2 task types: (1) general-disease tasks spanning specialties and disease incidence levels, and (2) prostate disease scenario covering the clinical workflow, including diagnosis, treatment planning, postoperative rehabilitation and prognosis. Clinicians with different seniority completed tasks independently and repeated them after reviewing LLM-generated responses. Performance was rated by experts using a 5-point Likert scale, and differences were analyzed with the Wilcoxon signed-rank test. RESULTS: LLM assistance significantly improved clinician performance across general disease tasks (P .05). Performance varied among models, with Deepseek-R1 performing best in diagnostic tasks. DISCUSSION: These findings suggest that LLM assistance may provide greater benefit to less-experienced clinicians and are particularly effective in downstream decision-making stages, indicating a potential role in mitigating experience-related performance disparities. CONCLUSIONS: LLMs may serve as supportive clinical-decision support tools across diseases, specialties, clinician experience levels, and workflow stages, with observed improvements in decision quality and reduced performance disparities.
Tao et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: