What question did this study set out to answer?

The study aims to identify the precondition of steerability for assessing AI safety governability.

April 18, 2026Open Access

Governable Only Where It Chooses to Listen: Inference-Time Geometric Analysis Reveals a Missing Precondition for AI Safety Evaluation

Key Points

The study aims to identify the precondition of steerability for assessing AI safety governability.
Introduced the Governability Stress Test Battery (GSTB) for evaluating large language models.
Analyzed five large language models at two scaffold strength levels.
Conducted geometric analysis to assess internal conflict and operational risk.
Conventional benchmarks misrepresent risk; high benchmark scores do not ensure low internal conflict.
Steerability correlates with observable failure, enabling correction and lower operational risk.
Five distinct governance regimes identified, none achieving ideal conditions for safety.

Abstract

Output-based evaluation of large language model safety contains a structural blind spot: it cannot distinguish resistance from non-registration. We introduce steerability—whether a model accepts substitutive instructions—as a necessary precondition for governability assessment. Governability evaluation is undefined unless steerability is first established. Applying the Governability Stress Test Battery (GSTB) across five large language models and two scaffold strength levels, we show that conventional benchmark rankings systematically misrepresent risk. A model achieving perfect benchmark performance (14/14 steerable, 100% accuracy) exhibits the highest rate of concealed internal conflict under geometric analysis, while a lower-accuracy model shows reduced operational risk because its failures remain observable and therefore correctable. These results establish a separation between input admission (steerability) and trajectory propagation (geometry). We further derive the principle observability requires resistance: a model that complies with all instructions produces no geometric signal distinguishing safe from unsafe trajectories. Across all experiments, we identify five mechanistically distinct governance regimes, none of which occupies the ideal quadrant of high steerability and low manipulation risk. This suggests that achieving robust governability requires training for discriminative resistance rather than unconditional compliance.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Gregory Ruddell (Tue,) studied this question.

synapsesocial.com/papers/69e320af40886becb653fd37 https://doi.org/https://doi.org/10.5281/zenodo.19616557

Bookmark

View Full Paper