Diagnostic errors and delays are common in rheumatology, driven by overlapping symptoms and the rarity of many diseases. While traditional diagnostic decision support systems (DDSS) have seen limited adoption because of high input burden and low perceived value, large language models (LLMs) now offer genuine dialogue and reduced effort, with rapidly improving diagnostic performance, yet empirical evidence on their real-world effectiveness and educational impact is still scarce. The aim of this study was to investigate the impact of an LLM on medical students’ diagnostic performance in rheumatology compared with traditional resources. In this randomized controlled trial, medical students solved three rheumatology vignettes. For each case, they provided a main diagnosis with confidence and up to four differential diagnoses. Participants were randomized to use ChatGPT-4o plus traditional resources or traditional resources alone. The primary outcome was the proportion of correct top diagnoses. Secondary outcomes were correctness within the top 5 diagnoses, a cumulative diagnostic score, diagnostic confidence, and completion time. Sixty-eight students (mean SD age 24.8 2.6 years) were randomized. The LLM group identified the correct top diagnosis more often than controls (77.5% vs. 32.4%), yielding an adjusted odds ratio of 7.0 (95% CI 3.8–14.4; P<.001), and also exceeded LLM-only performance (77.5% vs. 71.6%). Cumulative diagnostic scores were higher with LLM support (mean SD 12.3 2.3 vs. 6.7 3.2; P<.001), as was confidence (7.0 1.3 vs. 6.1 1.2; P<.001). Completion time increased in the LLM group (505 131 s vs. 287 106 s; P<.001). Medical students using an LLM achieved significantly higher diagnostic accuracy than those using conventional resources. Students assisted by the LLM also outperformed the model alone, highlighting the potential of human-AI collaboration. These findings suggest that LLMs may help improve clinical reasoning in complex fields such as rheumatology. However, these findings should be interpreted cautiously, as larger and more diverse studies are needed to confirm their generalisability. ClinicalTrials.gov, NCT06748170 registered 27 December 2024.
Roemer et al. (Wed,) studied this question.