What question did this study set out to answer?

The aim is to create a benchmark that measures agents’ stability and identity under selection pressures without optimization targets.

January 14, 2026Open Access

Coherence–Constrained Evaluation System: A Reproducible Benchmark for Measuring Stability, Drift, and Irreversibility Under Selection Pressure

Key Points

The aim is to create a benchmark that measures agents’ stability and identity under selection pressures without optimization targets.
Developed the Coherence–Constrained Evaluation System (CCES) to evaluate agents in read-only mode.
Assessed agents under adversarial perturbations that are non-competitive.
Measured coherence, record support, drift, and collapse regimes.
Applied CCES to large language models and reinforcement-learning agents.
Demonstrated that agents with similar performance can have vastly different survivability profiles.
Revealed latent brittleness and delayed collapse modes not visible in standard benchmarks.
Established structural viability as a critical evaluation axis for complex agentic systems.

Abstract

We introduce the Coherence–Constrained Evaluation System (CCES), a reproducible benchmark designed to measure the structural viability of agents under irreversible selection pressure. Existing evaluation frameworks emphasize performance metrics such as accuracy, reward, or loss, but fail to assess whether agents maintain coherence and identity when optimization signals are unavailable, misleading, or gameable. CCES addresses this gap by evaluating agents in read-only mode under adversarial but non-competitive perturbations, measuring coherence, record support, drift, and collapse regimes. Derived from Coherence–Selection Interface Theory (CSIT), CCES operationalizes selection without providing an optimization target, rendering its metrics resistant to direct reward hacking. We demonstrate CCES across two distinct agent classes—large language models and embodied reinforcement-learning agents in Safety Gym—showing that agents with comparable performance can exhibit sharply different survivability profiles. CCES reveals latent brittleness and delayed collapse modes invisible to standard benchmarks, establishing structural viability as a distinct and necessary evaluation axis for advanced agentic systems.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper