What question did this study set out to answer?

This research evaluates the effectiveness of three large language models in solving mathematical problems of varying complexity.

April 12, 2026Open Access

View Full Paper

Evaluation of LLMs for mathematical problem solving

RWRuonan WangMinistry of Education RWRunxi WangUNSW Sydney YSYunwen ShenUNSW Sydney

Key Points

This research evaluates the effectiveness of three large language models in solving mathematical problems of varying complexity.
Compared three LLMs (GPT-4o, DeepSeek-V3, Gemini-2.0) on three mathematics datasets (GSM8K, MATH500, MIT OpenCourseWare)
Employed the Structured Chain-of-Thought (SCoT) framework for assessment
Assessed final answer correctness, intermediate calculation accuracy, and problem comprehension
Gemini-2.0 showed the best overall performance, especially on advanced MIT problems
DeepSeek-V3 excelled in optimization tasks but had accuracy fluctuations in statistical inference
GPT-4o displayed stability but lacked detail in some solutions
All LLMs faced challenges with multi-step reasoning and symbolic logic

Abstract

• In this study, we compare three prominent LLMs on three mathematics datasets of varying complexities. • We use ScoT framework to assess final answer correctness, intermediate calculation accuracy, and problem comprehension. • GPT-4o is the most stable and consistent in performance across all the datasets. • DeepSeek-V3 is competitively strong in optimisation, but suffers from fluctuations in accuracy in statistical inference tasks. • Gemini-2.0 shows strong linguistic understanding and clarity, but performs poorly in multi-step reasoning and symbolic logic. • Our error analysis reveals particular deficits in each LLM Large Language Models (LLMs) have shown strong performance on a range of educational tasks, but their potential to solve difficult mathematical problems requires further evaluation. In this study, we compare three prominent LLMs, GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexity: GSM8K, MATH500, and the MIT OpenCourseWare dataset. We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. The results indicate that Gemini-2.0 achieved the strongest overall performance across the three datasets and performed particularly well on advanced questions from the MIT OpenCourseWare dataset. DeepSeek-V3 was competitively strong in well-structured domains such as optimisation, but showed less consistent performance in statistical inference tasks. Overall, the findings suggest that current LLMs can perform competitively on structured mathematical tasks, but face clear limitations on problems requiring sustained multistep reasoning and advanced symbolic manipulation. Supplementary qualitative analysis also revealed model-specific weaknesses: GPT-4o lacked sufficient precision or explanatory detail in some cases, DeepSeek-V3 often provided condensed solutions with limited intermediate detail, and Gemini-2.0 showed reduced flexibility on more complex mathematical problems.

Perguntar à IA

Bookmark

View Full Paper

Perguntar à IA

Bookmark

View Full Paper

Evaluation of LLMs for mathematical problem solving

Key Points

Abstract

Cite This Study

Also Consider

Also Consider