• In this study, we compare three prominent LLMs on three mathematics datasets of varying complexities. • We use ScoT framework to assess final answer correctness, intermediate calculation accuracy, and problem comprehension. • GPT-4o is the most stable and consistent in performance across all the datasets. • DeepSeek-V3 is competitively strong in optimisation, but suffers from fluctuations in accuracy in statistical inference tasks. • Gemini-2.0 shows strong linguistic understanding and clarity, but performs poorly in multi-step reasoning and symbolic logic. • Our error analysis reveals particular deficits in each LLM Large Language Models (LLMs) have shown strong performance on a range of educational tasks, but their potential to solve difficult mathematical problems requires further evaluation. In this study, we compare three prominent LLMs, GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexity: GSM8K, MATH500, and the MIT OpenCourseWare dataset. We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. The results indicate that Gemini-2.0 achieved the strongest overall performance across the three datasets and performed particularly well on advanced questions from the MIT OpenCourseWare dataset. DeepSeek-V3 was competitively strong in well-structured domains such as optimisation, but showed less consistent performance in statistical inference tasks. Overall, the findings suggest that current LLMs can perform competitively on structured mathematical tasks, but face clear limitations on problems requiring sustained multistep reasoning and advanced symbolic manipulation. Supplementary qualitative analysis also revealed model-specific weaknesses: GPT-4o lacked sufficient precision or explanatory detail in some cases, DeepSeek-V3 often provided condensed solutions with limited intermediate detail, and Gemini-2.0 showed reduced flexibility on more complex mathematical problems.
Wang et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: