What question did this study set out to answer?

The aim is to address two key bottlenecks in autonomous research systems: evaluator legitimacy and judgment preservation.

May 30, 2026Open Access

The Foundation Is the Bottleneck: Evaluator Legitimacy and Judgment Preservation in Autonomous Research

Key Points

The aim is to address two key bottlenecks in autonomous research systems: evaluator legitimacy and judgment preservation.
Proposed a four-part model comprising search prior, execution evaluator, research evaluator, and judgment-preservation channel.
Conducted a benchmark protocol comparing seven interface architectures across three dimensions: metric progress, discovery quality, and judgment preservation.
Included an evaluator-audit condition to assess improvements in research validity from human reviews.
Identified that scaling improves search and execution evaluators more easily compared to judgment preservation.
Highlighted the fragility of existentially quantified constraints based on the Constraint Inheritance Lemma.
Demonstrated that evaluator legitimacy and judgment preservation remain unresolved despite advancements in search mechanisms.

Abstract

Autonomous experiment loops are scaling rapidly: Karpathy's AutoResearch established the paradigm; SkyPilot parallelised it; Bilevel Autoresearch meta-optimised search mechanisms; Centaur hybridised LLM and classical search; Sibyl introduced self-evolving harness architecture; and AutoResearchClaw added verifiable reporting, failure-to-information conversion, and human-in-the-loop intervention modes. We identify a recurring pattern: each system improves search, execution, or memory, but none resolves two persistent bottlenecks — research-evaluator legitimacy (whether the metric captures the research objective) and judgment preservation (whether mechanistic context, failed trials, and structural insight survive compression across agent interfaces). We propose a four-part architectural model: search prior, execution evaluator, research evaluator, and judgment-preservation channel. Scaling improves the first two more easily than the latter two. We explain this asymmetry using the Constraint Inheritance Lemma from the representational theory of grounding (Badkur & Dak, 2026b): universally quantified constraints are robust under composition, while existentially quantified constraints are fragile. We propose a benchmark protocol comparing seven interface architectures on three outcome dimensions — metric progress, discovery quality, and judgment preservation — including an evaluator-audit condition that tests whether human review at high-leverage points improves research validity. Companion to: "What Survives Recursive Training: Three Bridges, the Evaluator Regress, and the Path to AGI" (Dak & Badkur, 2026).

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Badkur et al. (Thu,) studied this question.

synapsesocial.com/papers/6a1a82b80307b78509434726 https://doi.org/https://doi.org/10.5281/zenodo.20422120

Bookmark

View Full Paper