Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiaoyuan Li
Keqin Bao
Yubo Ma
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68da58d8c1728099cfd10f63 — DOI: https://doi.org/10.48550/arxiv.2505.17123
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: