What question did this study set out to answer?

Evaluate the performance of DeepSeek R1, GPT-4.1, and Claude 3.5 on complex clinical scenarios.

February 12, 2026

Measuring The Accuracy and Reproducibility of DeepSeek R1, Claude 3.5 Sonnet, and GPT‑4.1 on Complex Clinical Scenarios

Key Points

Evaluate the performance of DeepSeek R1, GPT-4.1, and Claude 3.5 on complex clinical scenarios.
Selected a dataset of complex medical cases.
Accessed models via application programming interfaces (APIs).
Used standardized prompts and a predefined evaluation protocol.
Achieved overall accuracy of 77.1% across models.
GPT-4.1 produced the fewest errors, while Claude 3.5 had the most.
Reproducibility was highest for DeepSeek (100%), followed by GPT-4.1 (97.5%) and Claude 3.5 (92%).

Abstract

Background: The integration of large language models (LLMs) into clinical diagnostics presents significant challenges regarding their accuracy and reliability. Objective: This study aimed to evaluate the performance of DeepSeek R1, an open-source reasoning model, alongside two other LLMs, GPT-4.1 and Claude 3.5 Sonnet, across multiple-choice clinical cases. Methods: A dataset of complex medical cases representative of real-world clinical practice was selected. For efficiency, models were accessed via application programming interfaces (APIs) and assessed using standardized prompts and a predefined evaluation protocol. Results: The models demonstrated an overall accuracy of 77.1%, with GPT-4 producing the fewest errors and Claude 3.5 the most. The reproducibility analysis indicated that the tests were very repeatable: DeepSeek (100%), GPT-4.1 (97.5%), and Claude 3.5 Sonnet (92%). Conclusions: While LLMs show promise for enhancing diagnostics, ongoing scrutiny is required to address error rates and validate standard medical answers. Given the limited dataset and prompting protocol, findings should not be interpreted as broader equivalence in real‑world clinical reasoning. This study demonstrates the need for robust evaluation standards, attention to error rates, and further research.

Bookmark

Measuring The Accuracy and Reproducibility of DeepSeek R1, Claude 3.5 Sonnet, and GPT‑4.1 on Complex Clinical Scenarios

Key Points

Abstract

Cite This Study