What does this research mean for the field?

Directed stochastic skill search (DS3) provides a more efficient framework for inference in large language models, optimizing compute costs and task success across various inference strategies. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

The aim is to explore efficient inference strategies for large language models while considering their computational costs.

March 2, 2026Open Access

A theory of inference compute scaling: reasoning through directed stochastic skill search

Key Points

The aim is to explore efficient inference strategies for large language models while considering their computational costs.
Introduced directed stochastic skill search (DS3) as a framework for inference.
Derived closed-form expressions for task success and compute cost.
Examined various inference strategies including chain-of-thought and tree-of-thought.
Connected insights from LLM training to inference patterns.
The framework reveals linear accuracy scaling with log-compute.
Identified preferred inference strategies vary by task difficulty and model capability.
Demonstrated emergent behavior in reasoning despite parameter plateaus.
Integrated best-of-N and majority voting within a single predictive framework.

Abstract

Abstract Large language models (LLMs) require considerable computation and energy resources during training and deployment. While scaling laws for training have guided much recent progress, inference costs represent a significant and growing component of the overall resource burden, particularly for reasoning models. Existing compute-optimality characterizations that consider model size, dataset size and inference tokens in isolation or fixed combinations may overlook more efficient operating points. We introduce directed stochastic skill search (DS3), a general framework that represents inference as stochastic traversal over a learnt skill graph. From a simplified yet expressive instantiation, we derive closed-form expressions for task success and compute cost across a wide range of inference strategies—including chain-of-thought (CoT) and tree-of-thought (ToT)—enabling comparisons by task difficulty and model capability. We extend a prior graph framework of LLM training to include inference and bridge DS3 with empirical scaling laws. We theoretically recover observed patterns, including linear accuracy scaling with log-compute, variation in preferred inference strategies by task and capability, emergent behaviour elicited by reasoning despite parameter plateaus and both best-of-N and majority voting (MV) captured within one framework. By characterizing training-inference interdependencies, our framework deepens theoretical understanding and supports principled algorithmic design and resource allocation. This article is part of the discussion meeting issue ‘Bits, neurons and qubits for sustainable AI’.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper