What question did this study set out to answer?

This paper reviews recent advancements and challenges in large language models, focusing on their reasoning capabilities and multilingual features.

September 23, 2025Open Access

Reasoning beyond limits: Advances and open problems for LLMs

Key Points

This paper reviews recent advancements and challenges in large language models, focusing on their reasoning capabilities and multilingual features.
Comprehensive review of 27 LLMs released between 2023 and 2025
Analysis of core innovations including training strategies and architectures
Discussion of challenges related to multi-step reasoning and task execution
Significant improvements in reasoning capabilities of LLMs through various advanced techniques
Enhancements in cross-lingual reasoning for multilingual models
Identification of key challenges for future LLM developments

Abstract

Recent breakthroughs in generative reasoning have fundamentally reshaped how large language models (LLMs) address complex tasks, enabling them to dynamically retrieve, refine, and organize information into coherent, multi-step reasoning chains. Techniques such as inference-time scaling, reinforcement learning, supervised fine-tuning, and distillation have been effectively applied to state-of-the-art models, including DeepSeek-R1, OpenAI’s o1 and o3, GPT-4o, Qwen-32B, and various Llama variants, significantly enhancing their reasoning capabilities. In this paper, we present a comprehensive review of the top 27 LLMs released between 2023 and 2025, such as Mistral AI Small 3 24B, DeepSeek-R1, Search-o1, QwQ-32B, and Phi-4, and analyze their core innovations and performance improvements. We also provide a detailed overview of recent advancements in multilingual large language models (MLLMs), emphasizing methods that improve cross-lingual reasoning and address the limitations of English-centric training. In parallel, we present a comprehensive review of progress in State Space Model (SSM)-based architectures, including models like Mamba, which demonstrate improved efficiency for long-context processing compared to Transformer-based approaches. Our analysis covers training strategies such as general optimization techniques, mixture-of-experts (MoE) configurations, retrieval-augmented generation (RAG), chain-of-thought prompting, self-improvement methods, and test-time compute scaling and distillation frameworks. Finally, we identify key challenges for future research, including enabling multi-step reasoning without human supervision, improving robustness in chained task execution, balancing structured prompting with generative flexibility, and enhancing the integration of long-context retrieval and external tools.

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper