What question did this study set out to answer?

This study aims to replicate findings on LLM and diagnostic reasoning among UK physicians while analyzing interaction patterns to understand performance gaps.

May 1, 2026

Human-AI collaboration in clinical reasoning: a UK replication and interaction analysis.

Key Points

This study aims to replicate findings on LLM and diagnostic reasoning among UK physicians while analyzing interaction patterns to understand performance gaps.
Within-subjects study involving UK physicians (N=22) answering structured questions on four clinical vignettes.
Participants accessed a LLM in two cases via a custom web application; results analyzed using a mixed-effects model accounting for case difficulty.
Qualitative analysis of participant-LLM interaction logs to evaluate LLM usage rates per question.
Physicians with LLM assistance scored significantly lower than LLM alone (mean difference 21.3 percentage points, p<0.001).
Access to the LLM improved performance compared to conventional resources (74.3% vs. 65.7%, p=0.001).
Only 30% of questions were directly posed to the LLM, indicating under-utilization contributed to the performance gap.

Abstract

OBJECTIVES: A paper from Goh et al. found that a large language model (LLM) working alone outperformed American clinicians assisted by the same LLM in diagnostic reasoning tests (Goh E, Gallo R, Hom J, Strong E, Weng J, Kerman H, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open 2024;7:e2440969). We aimed to replicate this experiment in a UK setting and explore how interactions with the LLM might explain the observed gaps in performance. METHODS: This was a within-subjects study of UK physicians. 22 participants answered structured questions on four clinical vignettes. For 2 cases physicians had access to an LLM via a custom-built web-application. Results were analysed using a mixed-effects model accounting for case difficulty and the variability of clinicians at baseline. Qualitative analysis involved coding of participant-LLM interaction logs and evaluating the rates of LLM use per question. RESULTS: Physicians with LLM assistance scored significantly lower than the LLM alone (mean difference 21.3 percentage points, p<0.001). Access to the LLM was associated with improved physician performance compared to using conventional resources (74.3 vs. 65.7 %, p=0.001). There was significant heterogeneity in the degree of LLM-assisted improvement (SD 12.8 %). Qualitative analysis revealed that only 30 % of case questions were directly posed to the LLM, which suggests that under-utilisation of the LLM contributed to the observed performance gap. CONCLUSIONS: While access to an LLM can improve diagnostic accuracy, realising the full potential of human-AI collaboration may require a focus on training clinicians to integrate these tools into their cognitive workflows and on designing systems that make these integrations the default rather than an optional extra.

Bookmark

Cite This Study

Healy et al. (Wed,) studied this question.

synapsesocial.com/papers/69f44488967e944ac556774c https://doi.org/https://doi.org/10.1515/dx-2025-0176

Bookmark