What question did this study set out to answer?

The main aim is to assess the mathematical reasoning capabilities of various large language models using modified complex benchmarks.

April 5, 2026Open Access

GSM-Identity: Evaluating Mathematical Reasoning in LLMs via Equivalence Transformations

Key Points

The main aim is to assess the mathematical reasoning capabilities of various large language models using modified complex benchmarks.
Developed the GSM-Identity pipeline to transform the GSM8K dataset with equivalent expressions.
Evaluated LLMs with varying parameters (7B to 72B) using different prompting strategies.
Applied comparisons between models’ performances on GSM8K and GSM-Identity, alongside human evaluations.
Math-oriented models maintain performance on GSM-Identity, while general models show significant drop in effectiveness.
7 billion parameter models perform similarly to humans when faced with modified problems.
Models over 70 billion parameters outperform humans and show resilience to question modifications.

Abstract

Abstract We introduce GSM-Identity, a pipeline to modify existing mathematical reasoning benchmarks by adding extra complexity to the questions while preserving their fundamental meaning. By systematically transforming numerical values in the GSM8K dataset into mathematically equivalent but less obvious expressions, we create a benchmark to measure Large Language Models (LLMs) mathematical understanding. We evaluate LLMs ranging from 7 billions to 72 billions parameters using multiple prompting strategies, including standard, notice-based, and chain-of-thought approaches. We find that Math oriented models can retain most of their performance on GSM8K when evaluated on GSM-Identity, while general purpose models show significant performance degradation. A comparison with human evaluations reveals that models in the 7 billion parameters range perform similar to humans when exposed to the kind of modifications we study, while models with more than 70 billion parameters are more accurate than humans in answering the questions and they are also more resilient to modifications. Our findings highlight GSM-Identity as a valuable tool for distinguishing reasoning from memorization, offering insights into the abilities of LLMs to understand higher level mathematical concepts.

KI fragen

Bookmark

View Full Paper

Cite This Study

Negi et al. (Tue,) studied this question.

synapsesocial.com/papers/69d1fc8ea79560c99a0a2326 https://doi.org/https://doi.org/10.1007/s10994-026-07029-7

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

KI fragen

Bookmark

View Full Paper