What type of study is this?

September 10, 2025

Logical Reasoning Capabilities of Large Language Models: A Comparative Evaluation on GPQA Dataset

Key Points

All tested models outperformed human experts in overall accuracy, showing the potential of large language models in reasoning tasks.
Models using Chain-of-Thought mechanisms demonstrated higher accuracy, signifying the importance of advanced reasoning strategies.
Physics tasks showed strong performance, while chemistry questions with complex organic inference revealed weaknesses in model capabilities.
Increased processing time did not improve accuracy, indicating inefficiencies in reasoning pathways of the models.

Abstract

Recent advancements in Large Language Models (LLMs) have significantly enhanced their logical reasoning capabilities, presenting new opportunities for applications in scientific reasoning tasks. This study systematically evaluates and compares the logical reasoning performance of five prominent LLMsGPT-4o, OpenAI-o3, OpenAI-o3 pro, DeepSeek V3, and DeepSeek R1using the GPQA dataset, a standardized collection of science-related multiple-choice questions spanning biology, chemistry, and physics. Three dimensions were analyzed: overall accuracy, response time, and performance across difficulty levels.Results indicate that all tested models outperformed human experts in overall accuracy. Particularly, models utilizing deep-thinking (Chain-of-Thought) mechanisms consistently surpassed those without, underscoring the effectiveness of advanced reasoning strategies. Domain-specific analyses revealed superior performance on structured computational tasks in physics but relatively weaker performance in chemistry questions involving complex organic inference. Notably, increased processing time (as observed with OpenAI-o3 pro) did not proportionally enhance accuracy. Detailed analysis suggests this discrepancy was due not to resource constraints but to inefficiencies in the models reasoning or exploratory pathways, as it frequently expended additional time retrieving descriptive but non-essential background information. Further investigation into these reasoning bottlenecks is necessary to better understand and overcome these limitations, providing valuable insights for future research and model improvements.

Bookmark

Cite This Study

Bo Wan (Wed,) studied this question.

synapsesocial.com/papers/68c1a12d54b1d3bfb60dc512 https://doi.org/https://doi.org/10.54254/2755-2721/2025.bj25665

Bookmark