Abstract As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring trustworthy reasoning is paramount. However, current evaluation strategies of LLMs’ medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains absent. To address this, we present MedThink-Bench, a benchmark designed for rigorous and scalable assessment of LLMs’ medical reasoning. MedThink-Bench comprises 500 high-complexity questions spanning ten medical domains, accompanied by expert-authored, step-by-step rationales that elucidate intermediate reasoning processes. Further, we introduce LLM-w-Rationale, an evaluation framework that combines fine-grained rationale assessment with an LLM-as-a-Judge paradigm, enabling expert-level fidelity in evaluating reasoning quality while preserving scalability. Results show that LLM-w-Rationale correlates strongly with expert evaluation (Pearson coefficient up to 0.87) while requiring only 1.4% of the evaluation time. Overall, MedThink-Bench establishes a rigorous and scalable standard for evaluating medical reasoning in LLMs, advancing their safe and responsible deployment in clinical practice.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shuang Zhou
Wenya Xie
Jiaxi Li
npj Digital Medicine
Columbia University
University of California, San Francisco
University of Minnesota
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhou et al. (Sat,) studied this question.
www.synapsesocial.com/papers/694020fd2d562116f28fb4eb — DOI: https://doi.org/10.1038/s41746-025-02208-7