What question did this study set out to answer?

This research aims to identify and analyze a new failure mode in AI safety infrastructure that is distinct from known issues.

April 19, 2026Open Access

A Category-Level Failure Mode Not Captured by Distribution Metrics

Read Full Paperexternally

Key Points

This research aims to identify and analyze a new failure mode in AI safety infrastructure that is distinct from known issues.
Specified three failure conditions: self-reference, anchor drift, and proxy displacement.
Demonstrated these conditions using the 2008 Gaussian copula collapse in credit markets.
Identified reinforcing paths in ML training infrastructure that emulate these failure conditions.
No current safety approach within ML meets all failure conditions simultaneously.
Current external safety methods do not adequately address the speed dimension related to shared substrate.
Pointed out limitations of viewing AI safety as merely a technical problem.

Abstract

This working paper identifies a failure mode in frontier AI safety infrastructure that is distinct from model collapse, benchmark contamination, or distribution drift. It describes a functional shift of evaluation categories — boundaries persist while what they separate becomes progressively opaque. The argument proceeds in three movements. First, three failure conditions are specified — self-reference, anchor drift, and proxy displacement — under which any category loses transparency. These conditions are demonstrated on the 2008 collapse of the Gaussian copula in credit markets, drawing on the co-production analysis of MacKenzie & Spears (2014). Second, the same conditions are shown to be assembling in ML training and evaluation infrastructure along three reinforcing paths: recursive synthetic injection (reifying self-reference), shared evaluation lineage (reifying anchor drift), and proxy signal amplification (reifying proxy displacement). Each path is a direct instantiation of one failure condition, and together they produce all three jointly across the categories "training data," "evaluation benchmark," and "human feedback." Third, three externality conditions are derived as the structural inverses of the failure conditions: provenance independence, external anchoring sustained by incentive misalignment, and speed parity. Four current safety approaches — UK/US AI Safety Institutes, interpretability research, open-weight release, and cryptographic attestation (zkML) — are mapped against these conditions, each reading as a specific trade-off rather than an uncategorized failure. No current approach internal to the ML substrate satisfies all three conditions simultaneously, and no current non-ML automation track (formal verification, static analysis of learned weights, symbolic AI) closes the gap left by substrate sharing on the speed dimension. The closing sections examine why the dominant "safety as a technical problem" framing has structural difficulty recognizing this failure mode, with the source of the limited visibility traced to the signal-to-noise structure of self-referential evaluation. Three partial external channels — physical-feedback loops, hardware attestation, and cryptographically certified human authorship — are sketched as starting points for the substrate-external audit infrastructure the note argues is required.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ghjuvan Ortulanu

NatureServe

Actions

Institutions

NatureServe

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Category-Level Failure Mode Not Captured by Distribution Metrics

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study