What question did this study set out to answer?

The aim is to address two key bottlenecks in autonomous research systems: evaluator legitimacy and judgment preservation.

May 30, 2026Open Access

The Foundation Is the Bottleneck: Evaluator Legitimacy and Judgment Preservation in Autonomous Research

Read Full Paperexternally

Key Points

The aim is to address two key bottlenecks in autonomous research systems: evaluator legitimacy and judgment preservation.
Proposed a four-part model comprising search prior, execution evaluator, research evaluator, and judgment-preservation channel.
Conducted a benchmark protocol comparing seven interface architectures across three dimensions: metric progress, discovery quality, and judgment preservation.
Included an evaluator-audit condition to assess improvements in research validity from human reviews.
Identified that scaling improves search and execution evaluators more easily compared to judgment preservation.
Highlighted the fragility of existentially quantified constraints based on the Constraint Inheritance Lemma.
Demonstrated that evaluator legitimacy and judgment preservation remain unresolved despite advancements in search mechanisms.

Abstract

Autonomous experiment loops are scaling rapidly: Karpathy's AutoResearch established the paradigm; SkyPilot parallelised it; Bilevel Autoresearch meta-optimised search mechanisms; Centaur hybridised LLM and classical search; Sibyl introduced self-evolving harness architecture; and AutoResearchClaw added verifiable reporting, failure-to-information conversion, and human-in-the-loop intervention modes. We identify a recurring pattern: each system improves search, execution, or memory, but none resolves two persistent bottlenecks — research-evaluator legitimacy (whether the metric captures the research objective) and judgment preservation (whether mechanistic context, failed trials, and structural insight survive compression across agent interfaces). We propose a four-part architectural model: search prior, execution evaluator, research evaluator, and judgment-preservation channel. Scaling improves the first two more easily than the latter two. We explain this asymmetry using the Constraint Inheritance Lemma from the representational theory of grounding (Badkur & Dak, 2026b): universally quantified constraints are robust under composition, while existentially quantified constraints are fragile. We propose a benchmark protocol comparing seven interface architectures on three outcome dimensions — metric progress, discovery quality, and judgment preservation — including an evaluator-audit condition that tests whether human review at high-leverage points improves research validity. Companion to: "What Survives Recursive Training: Three Bridges, the Evaluator Regress, and the Path to AGI" (Dak & Badkur, 2026).

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Prashi Badkur

Indian Institute of Technology Bombay

Mohit Dak

Birla Institute of Technology and Science, Pilani

Actions

Institutions

Columbia University

London Business School

Indian Institute of Technology Bombay

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

The Foundation Is the Bottleneck: Evaluator Legitimacy and Judgment Preservation in Autonomous Research

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study