Large language models (LLMs) have shown great potential in solving complex mathematical problems, but their performance in university-level mathematics is still underexplored. This study provides a systematic evaluation of eleven state-of-the-art LLMs on five core undergraduate mathematics courses. An end-to-end automated pipeline is proposed for solution inference and answer evaluation, including a reliable ensemble evaluation scheme using seven reasoning-enabled LLMs as expert evaluators. The experimental results show that reasoning models outperform non-reasoning ones significantly, with DeepSeek-V3.2 and Kimi-K2.5 achieving average scores of 88.24 and 88.99, respectively. Under the conditions of the conducted experiment in this study, DeepSeek-V3.2 achieves the most reasonable accuracy–cost trade-off. This work reveals the strengths and limitations of modern LLMs in advanced mathematical reasoning and provides insights into their application in AI mathematical education.
Yang et al. (Fri,) studied this question.