We present a framework for auditing whether LLM agent outputs are grounded in their context window or speculative extrapolations. The core insight: the boundary between what a model is given (context, retrieved documents, tool outputs) and what it must generate provides a natural ground-truth oracle for epistemic state. A question answerable from context is factual; one requiring extrapolation beyond it is speculative. Training a lightweight MLP on activations at this boundary yields a deployable pre-generation auditor that classifies grounding state before any output enters the context window. A frozen layer sweep with the boundary dataset across five model families (Qwen2. 5-7B, GPT-J 6B, Mistral-7B, Llama 3. 2 3B, Qwen3. 5-9B) reveals a striking pattern: more capable models, particularly those trained with reinforcement learning at scale encode epistemic state earlier and more geometrically cleanly in the network, making them more auditable, not less. Qwen3. 5-9B (RL-trained, competing with 13× larger models on math benchmarks) achieves 99. 5% frozen linear probe accuracy at layers 14-17, comparable to post-fine-tuning results on conventionally trained models. This inverts the conventional assumption that capability and interpretability trade off. For conventionally trained models where the boundary signal is diffuse (Qwen2. 5 family), we identify a thermodynamic approach: delta magnitudes T (x) = mean (||h₍+₁ - hₙ||) achieve 98. 39% accuracy (AUC 0. 9786) on Qwen2. 5-1. 5B and 98. 13% (AUC 0. 9948) on Qwen2. 5-7B with zero additional parameters, representing a 5. 86-7. 48pp AUC improvement over topological entropy methods (0. 92). For RL-trained models, the geometry is already clean enough that a 2M-parameter MLP suffices. A second experiment on SNLI contradiction detection (97. 75% MLP accuracy) completes the dual-probe auditor.
Building similarity graph...
Analyzing shared references across papers
Loading...
Scott Seto
Building similarity graph...
Analyzing shared references across papers
Loading...
Scott Seto (Fri,) studied this question.
synapsesocial.com/papers/69acc59c32b0ef16a4050025 — DOI: https://doi.org/10.5281/zenodo.18893369
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: