What question did this study set out to answer?

This research aims to explore how large language models (LLMs) affect medical students' diagnostic performance in rheumatology compared to traditional resources.

March 26, 2026Open Access

Large language models enhance diagnostic reasoning of medical students in rheumatology: a randomized controlled trial

Key Points

This research aims to explore how large language models (LLMs) affect medical students' diagnostic performance in rheumatology compared to traditional resources.
Conducted a randomized controlled trial involving medical students solving rheumatology vignettes.
Participants were divided into two groups: one using LLMs in conjunction with traditional resources and another using only traditional resources.
Measured primary and secondary outcomes such as correct diagnoses, diagnostic confidence, and completion time.
The LLM group identified the correct top diagnosis more frequently (77.5%) compared to controls (32.4%).
Cumulative diagnostic scores were significantly higher for the LLM group (mean score 12.3) than for the control group (mean score 6.7).
LLM users reported greater diagnostic confidence (7.0) compared to those using traditional resources (6.1).
Completion time for cases increased in the LLM group, averaging 505 seconds compared to 287 seconds for controls.

Abstract

Diagnostic errors and delays are common in rheumatology, driven by overlapping symptoms and the rarity of many diseases. While traditional diagnostic decision support systems (DDSS) have seen limited adoption because of high input burden and low perceived value, large language models (LLMs) now offer genuine dialogue and reduced effort, with rapidly improving diagnostic performance, yet empirical evidence on their real-world effectiveness and educational impact is still scarce. The aim of this study was to investigate the impact of an LLM on medical students’ diagnostic performance in rheumatology compared with traditional resources. In this randomized controlled trial, medical students solved three rheumatology vignettes. For each case, they provided a main diagnosis with confidence and up to four differential diagnoses. Participants were randomized to use ChatGPT-4o plus traditional resources or traditional resources alone. The primary outcome was the proportion of correct top diagnoses. Secondary outcomes were correctness within the top 5 diagnoses, a cumulative diagnostic score, diagnostic confidence, and completion time. Sixty-eight students (mean SD age 24.8 2.6 years) were randomized. The LLM group identified the correct top diagnosis more often than controls (77.5% vs. 32.4%), yielding an adjusted odds ratio of 7.0 (95% CI 3.8–14.4; P<.001), and also exceeded LLM-only performance (77.5% vs. 71.6%). Cumulative diagnostic scores were higher with LLM support (mean SD 12.3 2.3 vs. 6.7 3.2; P<.001), as was confidence (7.0 1.3 vs. 6.1 1.2; P<.001). Completion time increased in the LLM group (505 131 s vs. 287 106 s; P<.001). Medical students using an LLM achieved significantly higher diagnostic accuracy than those using conventional resources. Students assisted by the LLM also outperformed the model alone, highlighting the potential of human-AI collaboration. These findings suggest that LLMs may help improve clinical reasoning in complex fields such as rheumatology. However, these findings should be interpreted cautiously, as larger and more diverse studies are needed to confirm their generalisability. ClinicalTrials.gov, NCT06748170 registered 27 December 2024.

Bookmark

View Full Paper

Bookmark

View Full Paper

Large language models enhance diagnostic reasoning of medical students in rheumatology: a randomized controlled trial

Key Points

Abstract

Cite This Study