Reasoning is a hallmark of human intelligence, enabling adaptive decision-making in complex unfamiliar scenarios. In contrast, machine intelligence remains bound to training data, unable to dynamically refine solutions at inference. While recent advances have explored machine reasoning - trading inference-time compute for improved performance - they focus on verbal domains such as mathematical problem-solving where explicit rules govern step-by-step solution generation. Many tasks lack sufficient labelled data and require alternative performance improvement mechanisms, such as inference-time compute. Here we present a paradigm for machine reasoning in vision, enabling performance improvements with increasing thinking time (inference-time compute), even with limited labelled data. Our approach is inspired by dual-process theories of human cognition, integrating a fast-thinking System I module for generating and verifying solutions in familiar tasks, with a slow-thinking System II module that iteratively refines predictions using self-play reinforcement learning, even when task-specific data is limited. This paradigm involves proposing, competing over, and refining solutions until convergence. We demonstrate that extended inference-time compute yields superior performance compared to large-scale supervised learning, foundation models, and human experts in vision tasks. These include computer-vision benchmarks and cancer localisation across five organs, highlighting the potential of inference-time compute for data-scarce problems.
Saeed et al. (Tue,) studied this question.