September 28, 2025Open Access

Highlights of the Issue - Large Language Models III

Key Points

Reasoning models within large language models exhibit performance collapse as problem complexity increases, indicating limits in generalizable problem-solving capabilities.
Empirical findings reveal standard large language models outperform reasoning models on simpler tasks, while reasoning models excel under tool augmentations.
Evaluation of reasoning capabilities involved a controlled setup contrasting tool-augmented large reasoning models against standard models.
The results emphasize the need for structured reasoning and efficient tool utilization to enhance reasoning efficiency beyond inherent limitations.

Abstract

We continue our LLM series (LLM I, LLM II) emphasizing safety and value alignment. After Apple’s provocative article, The Illusion of Thinking, Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, published in June, several articles were published critiquing or rebutting the premise that LLM “thinking” is an illusion. (Of course this debate depends on semantics: how one defines “thinking.”) Notably, the Apple team use existing benchmarks to vary problem “compositional complexity” to tease out the quality of the LRM reasoning and found a sharp decline in quality at a certain level of compositional complexity in several tests. The “complexity” that the Apple team identified was the number of components of a puzzle (e.g. the number of disks in Tower of Hanoi). The most interesting follow-up article on the Apple work, to us, is Thinking Isn’t an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations. First, Song et al.’s focus on LLM and Chain of Thought limitations deriving from its computational complexity classification in circuit class TCk attempts to understand LLM limitations from the fundamental paradigm of computational complexity. (We publish a report on the class TCk in this issue.) That noted, Song et al. augmented their LRM with additional tools and, to large extent, overcame the problem-complexity-related degradation in reasoning found by the Apple team. Here is their insight: …the underperformance of LRMs on hard tasks may not reflect a fundamental reasoning deficiency, but rather an artifact of the limited output window. A natural solution is to augment both models with external tools, such as Python interpreters or scratchpads, to overcome this limitation and better reflect the models’ actual reasoning abilities (pg. 2) They created a working-memory buffer to overcome complexity. Further, they created “structured reasoning” algorithms: Structured reasoning refers to the process of breaking down complex problems into smaller, manageable steps and solving them systematically. This approach ensures that reasoning is logical, organized, and aligned with the problem's requirements. In the context of Large Reasoning Models (LRMs), structured reasoning is achieved through specific techniques and tools that guide the model to follow a step-by-step process (Adobe AI Assistant). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar Apple Computer From the article (pg 5): Rather than standard benchmarks (e.g., math problems), we adopt controllable puzzle environments that let us vary complexity systematically—by adjusting puzzle elements while preserving the core logic—and inspect both solutions and internal reasoning (Fig. 1, top). These puzzles: (1) offer fine-grained control over complexity; (2) avoid contamination common in established benchmarks; (3) require only the explicitly provided rules, emphasizing algorithmic reasoning; and (4) support rigorous, simulator-based evaluation, enabling precise solution checks and detailed failure analyses. Our empirical investigation reveals several key findings about current Language Reasoning Models (LRMs): First, despite their sophisticated self-reflection mechanisms learned through reinforcement learning, these models fail to develop generalizable problem-solving capabilities for planning tasks, with performance collapsing to zero beyond a certain complexity threshold. Second, our comparison between LRMs and standard LLMs under equivalent inference compute reveals three distinct reasoning regimes (Fig. 1, bottom). For simpler, low-compositional problems, standard LLMs demonstrate greater efficiency and accuracy. As problem complexity moderately increases, thinking models gain an advantage. However, when problems reach high complexity with longer compositional depth, both model types experience complete performance collapse (Fig. 1, bottom left). Thinking Isn’t an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations Zhao Song, Song Yue, Jiahao Zhang From the article (p. 3): Despite the progress in Large Reasoning Models (LRMs), recent work has questioned whether LRMs genuinely improve reasoning performance over standard LLMs. Theoretical analyses based on circuit complexity suggest that a Transformer using k CoT steps corresponds to the TCk circuit class, indicating that even multi-step CoT reasoning may be limited in the complexity of problems it can solve GRS+23, LLZM24, KS25. Empirical evidence also shows that LRMs often generate lengthy outputs with many redundant or irrelevant tokens, increasing inference cost without improving task accuracy CXL+24, QYS+25, SCW+25. Furthermore, studies on math reasoning tasks indicate that reinforcement learning may not consistently enhance LRM performance MAS+25. A particularly notable benchmark is Apple’s “thinking-illusion” framework SMA+25, which evaluates both LLMs and LRMs without any tool augmentations under controlled settings with varying task complexities. Their results show that LRMs outperform LLMs only on tasks of medium difficulty, while providing no clear advantage on either simple or very challenging problems. In this paper, we revisit the evaluation of reasoning capabilities in LLMs and LRMs using a carefully controlled experimental setup. In contrast to previous work SMA+25, we augment both model types with external tools, specifically a Python interpreter and a scratchpad, and find that LRMs with tool augmentation consistently outperform LLMs with the same tool access. These results challenge prior empirical claims and offer new insights into the potential of LRMs under practical usage scenarios. LLM Tool Use. Due to inherent limitations in Large Language Models (LLMs), such as restricted output length and hallucinations JYX+23, CQT+24, a growing body of research has explored the use of external tools to enhance their problem-solving capabilities. From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training OpenAI Blog post: GPT‑5 advances the frontier on safety. In the past, ChatGPT relied primarily on refusal-based safety training: based on the user’s prompt, the model should either comply or refuse. While this type of training works well for explicitly malicious prompts, it can struggle to handle situations where the user’s intent is unclear, or information could be used in benign or malicious ways. Refusal training is especially inflexible for dual-use domains such as virology, where a benign request can be safely completed at a high level, but might enable a bad actor if completed in detail. For GPT‑5, we introduced a new form of safety-training — safe completions — which teaches the model to give the most helpful answer where possible while still staying within safety boundaries. Sometimes, that may mean partially answering a user’s question or only answering at a high level. If the model needs to refuse, GPT‑5 is trained to transparently tell you why it is refusing, as well as provide safe alternatives. In both controlled experiments and our production models, we find that this approach is more nuanced, enabling better navigation of dual-use questions, stronger robustness to ambiguous intent, and fewer unnecessary overrefusals. OpenAI’s new approach to safety including metrics and results is here.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper