Autonomous experiment loops are scaling rapidly: Karpathy's AutoResearch established the paradigm; SkyPilot parallelised it; Bilevel Autoresearch meta-optimised search mechanisms; Centaur hybridised LLM and classical search; Sibyl introduced self-evolving harness architecture; and AutoResearchClaw added verifiable reporting, failure-to-information conversion, and human-in-the-loop intervention modes. We identify a recurring pattern: each system improves search, execution, or memory, but none resolves two persistent bottlenecks — research-evaluator legitimacy (whether the metric captures the research objective) and judgment preservation (whether mechanistic context, failed trials, and structural insight survive compression across agent interfaces). We propose a four-part architectural model: search prior, execution evaluator, research evaluator, and judgment-preservation channel. Scaling improves the first two more easily than the latter two. We explain this asymmetry using the Constraint Inheritance Lemma from the representational theory of grounding (Badkur & Dak, 2026b): universally quantified constraints are robust under composition, while existentially quantified constraints are fragile. We propose a benchmark protocol comparing seven interface architectures on three outcome dimensions — metric progress, discovery quality, and judgment preservation — including an evaluator-audit condition that tests whether human review at high-leverage points improves research validity. Companion to: "What Survives Recursive Training: Three Bridges, the Evaluator Regress, and the Path to AGI" (Dak & Badkur, 2026).
Badkur et al. (Thu,) studied this question.