What type of study is this?

This is a Quantitative Study study.

September 22, 2025Open Access

Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

Key Points

Code-assisted LLMs showed varying degrees of logical soundness in mathematical reasoning tasks, affecting their performance.
Evaluation metrics need to move beyond execution correctness, as many code generations rely on unsound reasoning.
Increased problem difficulty correlated with lower soundness in generated programs across all models assessed.
Closed-source models grounded their programs in genuine mathematical concepts, unlike many open-source counterparts.

Abstract

Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness. Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest. Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs' limits in the math domain.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper

Cite This Study

Al-Khalili et al. (Thu,) studied this question.

synapsesocial.com/papers/68d46fdc31b076d99fa6a594 https://doi.org/https://doi.org/10.48550/arxiv.2504.17665

AIに質問

Bookmark

View Full Paper