What question did this study set out to answer?

This research aims to evaluate the diagnostic performance of the open-source model DeepSeek-R1 and its distilled versions compared to base models.

May 3, 2026

Open-Source Large Language Models Distilled DeepSeek-R1 Pose Challenges for On-Premises Clinical Deployment in Medical Diagnosis: A Comparative Study of Performance.

Key Points

This research aims to evaluate the diagnostic performance of the open-source model DeepSeek-R1 and its distilled versions compared to base models.
Conducted paired comparisons of five DeepSeek-R1 models against their base models on a dataset of 110 simulated clinical cases.
Used McNemar's test to assess diagnostic accuracy with a significance threshold of 0.01.
Analyzed errors in model outputs to identify common modes of failure in distilled models.
DeepSeek-R1-671B outperformed DeepSeek-V3 (95.45% vs. 88.18%; p = 0.008).
DeepSeek-R1-8B underperformed compared to Llama3.1-8B (47.27% vs. 64.54%; p = 0.003).
No significant differences for mid-sized models; qualitative analysis identified reasoning drift and other error modes in distilled models.

Abstract

The open-source reasoning large language model DeepSeek-R1 is increasingly being used in hospitals, but its multiple parameter versions, especially the distilled models, have not been fully evaluated for diagnostic performance. To address this, paired comparisons were conducted using five DeepSeek-R1 models and their respective base models. The models were tested on a diagnostic dataset of 110 simulated clinical cases from open access data, covering internal medicine, surgery, neurology, gynecology, and pediatrics, and categorized by incidence (frequent, less frequent, rare). The models were tasked with generating five preliminary diagnoses based on clinical symptoms, and a response was considered correct if the accurate diagnosis was included in the five generated. The model pairings were DeepSeek-R1-8B vs. Llama3.1-8B, DeepSeek-R1-14B vs. Qwen2.5-14B, DeepSeek-R1-32B vs. Qwen2.5-32B, DeepSeek-R1-70B vs. Llama3.3-70B, and DeepSeek-R1-671B vs. DeepSeek-V3. All reasoning models except DeepSeek-R1-671B were distilled versions. Diagnostic accuracy was assessed using McNemar's test for discordant pairs, with a significance threshold of 0.01. The results showed that DeepSeek-R1-671B significantly outperformed DeepSeek-V3 (95.45% vs. 88.18%; p = 0.008), while DeepSeek-R1-8B underperformed relative to Llama3.1-8B (47.27% vs. 64.54%; p = 0.003). No significant differences were observed for the mid-sized models. Subgroup analyses based on incidence and clinical specialties further supported these conclusions. Qualitative analysis of the chain-of-thought outputs in incorrect cases revealed three universally prevalent error modes across distilled models: Reasoning drift, Red-Flag recognition failure, and diagnostic priority inversion. The study concludes that the DeepSeek-R1-671B shows potential for medical diagnosis, but distilled models do not exceed their base models. Based on simulated clinical cases, our results do not support deploying distilled models for text-based diagnostic tasks without further validation on real patient data.

Demander à l'IA

Bookmark