What question did this study set out to answer?

The aim is to evaluate and contrast the diagnostic accuracy of deepseek-r1 and gpt-4 compared to human experts.

May 8, 2026Open Access

Rational calculation and embodied practice: the irreplaceability of human experts

Key Points

The aim is to evaluate and contrast the diagnostic accuracy of deepseek-r1 and gpt-4 compared to human experts.
Analyzed 100 clinical pathological cases from The New England Journal of Medicine
Compared diagnostic matching rates and quality scores between deepseek-r1 and gpt-4.
Assessed inter-rater consistency among human clinicians.
Diagnostic matching rates were similar for deepseek-r1 (35%) and gpt-4 (39%).
Deepseek-r1 had a lower correct diagnostic inclusion rate (48%) compared to gpt-4 (64%).
Inter-rater consistency among human clinicians was moderate (κ = 0.565).

Abstract

Chan et al made a strict comparison of the application of deepseek-r1 and gpt-4 in complex medical diagnosis1, revealing that they have comparable core accuracy, but their inherent limitations highlight the irreplaceable nature of human experts. This study used 100 clinical pathological cases of The New England Journal of Medicine, and showed that although the diagnostic matching rate of the two large language models (LLMs) was similar (35% vs 39%), and the quality scores were different, their performance was still far from reaching the reliability of human level. This correspondence adheres to the TITAN guidelines2. It is worth noting that the low correct diagnostic inclusion rate of deepseek-r1 (48% vs 64% of gpt-4) and the long list of differences (11.9 ± 2.0 vs 9.0 ± 1.4) highlight a key defect: the lack of specific judgment of human clinicians on the rationality of the algorithm. LLM generates diagnosis based on statistical patterns in the data, but human experts integrate contextual nuances that cannot be fully captured by structured case files, such as patient history, physical examination results, and clinical intuition. Even among human raters, moderate inter-rater consistency (κ = 0.565) further emphasizes the complexity of diagnostic judgment, which is a skill honed through years of practice. The advantage of this research lies in its controllable design, which eliminates “cheating” through pre training or internet access, thus reflecting the inherent ability of the model. However, this also exposes their rigidity: LLM cannot adapt to unforeseen clinical scenarios, nor can it incorporate real-time patient feedback, which is the core advantage of human experts. Although the open-source nature and reasoning strategy of deepseek-r1 bring hope to artificial intelligence (AI) auxiliary tools, its 35% diagnostic accuracy emphasizes that LLM is a supplement to human judgment, not a substitute. Ethically and practically, medical diagnosis requires more than statistical accuracy – it requires accountability, empathy and moral reasoning3,4,. LLM lacks the ability to deal with ethical dilemmas or sympathize with diagnosis, which are indispensable for human medical practice5. The limitations of this study, including selection bias and subjective quality score, further highlight the need for human supervision in verifying and contextualizing AI output. In conclusion, the research of Chan et al not only confirmed the potential of LLM in enhancing clinical decision-making but also strengthened the irreplaceable nature of human experts. The gap between algorithmic rationality and specific clinical practice – rooted in experience, intuition and ethical judgment – ensures that human clinicians remain at the core of complex medical diagnosis. The future development of AI should focus on the collaboration model using the advantages of LLM, while retaining the human factors that are crucial to high-quality health care.

Rational calculation and embodied practice: the irreplaceability of human experts

Key Points

Abstract

Cite This Study

Also Consider

Also Consider