Recent advancements in Large Language Models (LLMs) have significantly enhanced their logical reasoning capabilities, presenting new opportunities for applications in scientific reasoning tasks. This study systematically evaluates and compares the logical reasoning performance of five prominent LLMsGPT-4o, OpenAI-o3, OpenAI-o3 pro, DeepSeek V3, and DeepSeek R1using the GPQA dataset, a standardized collection of science-related multiple-choice questions spanning biology, chemistry, and physics. Three dimensions were analyzed: overall accuracy, response time, and performance across difficulty levels.Results indicate that all tested models outperformed human experts in overall accuracy. Particularly, models utilizing deep-thinking (Chain-of-Thought) mechanisms consistently surpassed those without, underscoring the effectiveness of advanced reasoning strategies. Domain-specific analyses revealed superior performance on structured computational tasks in physics but relatively weaker performance in chemistry questions involving complex organic inference. Notably, increased processing time (as observed with OpenAI-o3 pro) did not proportionally enhance accuracy. Detailed analysis suggests this discrepancy was due not to resource constraints but to inefficiencies in the models reasoning or exploratory pathways, as it frequently expended additional time retrieving descriptive but non-essential background information. Further investigation into these reasoning bottlenecks is necessary to better understand and overcome these limitations, providing valuable insights for future research and model improvements.
Bo Wan (Wed,) studied this question.