The open-source reasoning large language model DeepSeek-R1 is increasingly being used in hospitals, but its multiple parameter versions, especially the distilled models, have not been fully evaluated for diagnostic performance. To address this, paired comparisons were conducted using five DeepSeek-R1 models and their respective base models. The models were tested on a diagnostic dataset of 110 simulated clinical cases from open access data, covering internal medicine, surgery, neurology, gynecology, and pediatrics, and categorized by incidence (frequent, less frequent, rare). The models were tasked with generating five preliminary diagnoses based on clinical symptoms, and a response was considered correct if the accurate diagnosis was included in the five generated. The model pairings were DeepSeek-R1-8B vs. Llama3.1-8B, DeepSeek-R1-14B vs. Qwen2.5-14B, DeepSeek-R1-32B vs. Qwen2.5-32B, DeepSeek-R1-70B vs. Llama3.3-70B, and DeepSeek-R1-671B vs. DeepSeek-V3. All reasoning models except DeepSeek-R1-671B were distilled versions. Diagnostic accuracy was assessed using McNemar's test for discordant pairs, with a significance threshold of 0.01. The results showed that DeepSeek-R1-671B significantly outperformed DeepSeek-V3 (95.45% vs. 88.18%; p = 0.008), while DeepSeek-R1-8B underperformed relative to Llama3.1-8B (47.27% vs. 64.54%; p = 0.003). No significant differences were observed for the mid-sized models. Subgroup analyses based on incidence and clinical specialties further supported these conclusions. Qualitative analysis of the chain-of-thought outputs in incorrect cases revealed three universally prevalent error modes across distilled models: Reasoning drift, Red-Flag recognition failure, and diagnostic priority inversion. The study concludes that the DeepSeek-R1-671B shows potential for medical diagnosis, but distilled models do not exceed their base models. Based on simulated clinical cases, our results do not support deploying distilled models for text-based diagnostic tasks without further validation on real patient data.
Zhong et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: