What does this research mean for the field?

Reasoning-enabled large language models significantly outperform non-reasoning models in university-level mathematics, with models like DeepSeek-V3.2 and Kimi-K2.5 achieving high accuracy and DeepSeek-V3.2 providing the optimal accuracy-cost trade-off. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to evaluate the performance of large language models in solving university-level mathematics problems.

May 31, 2026Open Access

View Full Paper

Exploring Large Language Models for University-Level Mathematics: A Comparative Study

FYFan YangJiangsu University WXWanwan XiaNanjing Tech University WQWenting QinNanjing Tech University

Key Points

This research aims to evaluate the performance of large language models in solving university-level mathematics problems.
Systematic evaluation of eleven state-of-the-art large language models on five undergraduate mathematics courses.
Development of an automated pipeline for solution inference and answer evaluation.
Use of seven reasoning-enabled models for ensemble evaluation.
Reasoning models significantly outperform non-reasoning models.
DeepSeek-V3.2 and Kimi-K2.5 achieved average scores of 88.24 and 88.99, respectively.
DeepSeek-V3.2 offers the best accuracy-cost trade-off based on the experimental conditions.

Abstract

Large language models (LLMs) have shown great potential in solving complex mathematical problems, but their performance in university-level mathematics is still underexplored. This study provides a systematic evaluation of eleven state-of-the-art LLMs on five core undergraduate mathematics courses. An end-to-end automated pipeline is proposed for solution inference and answer evaluation, including a reliable ensemble evaluation scheme using seven reasoning-enabled LLMs as expert evaluators. The experimental results show that reasoning models outperform non-reasoning ones significantly, with DeepSeek-V3.2 and Kimi-K2.5 achieving average scores of 88.24 and 88.99, respectively. Under the conditions of the conducted experiment in this study, DeepSeek-V3.2 achieves the most reasonable accuracy–cost trade-off. This work reveals the strengths and limitations of modern LLMs in advanced mathematical reasoning and provides insights into their application in AI mathematical education.

AI에게 질문

Bookmark

View Full Paper

Cite This Study

Yang et al. (Fri,) studied this question.

synapsesocial.com/papers/6a1bd21d5783ba022b6fd76d https://doi.org/https://doi.org/10.3390/math14111886

AI에게 질문

Bookmark

View Full Paper