What question did this study set out to answer?

To audit LLM agent outputs and assess their grounding in context versus speculative information.

March 8, 2026Open Access

Context Window Boundary Probing: A Framework for Auditing Epistemic Grounding in Deployed LLM Agents

Key Points

To audit LLM agent outputs and assess their grounding in context versus speculative information.
Developed a framework for auditing outputs based on context and extrapolation.
Trained a lightweight MLP on activations before output generation.
Conducted experiments across multiple model families, including RL-trained models.
More capable RL-trained models encode epistemic state more effectively, achieving higher accuracy for audits.
Achieved 99.5% accuracy in frozen layer probes with minimal parameters in select layers.
Found that the boundary signal in conventionally trained models is diffuse compared to RL counterparts.

Abstract

We present a framework for auditing whether LLM agent outputs are grounded in their context window or speculative extrapolations. The core insight: the boundary between what a model is given (context, retrieved documents, tool outputs) and what it must generate provides a natural ground-truth oracle for epistemic state. A question answerable from context is factual; one requiring extrapolation beyond it is speculative. Training a lightweight MLP on activations at this boundary yields a deployable pre-generation auditor that classifies grounding state before any output enters the context window. A frozen layer sweep with the boundary dataset across five model families (Qwen2. 5-7B, GPT-J 6B, Mistral-7B, Llama 3. 2 3B, Qwen3. 5-9B) reveals a striking pattern: more capable models, particularly those trained with reinforcement learning at scale encode epistemic state earlier and more geometrically cleanly in the network, making them more auditable, not less. Qwen3. 5-9B (RL-trained, competing with 13× larger models on math benchmarks) achieves 99. 5% frozen linear probe accuracy at layers 14-17, comparable to post-fine-tuning results on conventionally trained models. This inverts the conventional assumption that capability and interpretability trade off. For conventionally trained models where the boundary signal is diffuse (Qwen2. 5 family), we identify a thermodynamic approach: delta magnitudes T (x) = mean (||h₍+₁ - hₙ||) achieve 98. 39% accuracy (AUC 0. 9786) on Qwen2. 5-1. 5B and 98. 13% (AUC 0. 9948) on Qwen2. 5-7B with zero additional parameters, representing a 5. 86-7. 48pp AUC improvement over topological entropy methods (0. 92). For RL-trained models, the geometry is already clean enough that a 2M-parameter MLP suffices. A second experiment on SNLI contradiction detection (97. 75% MLP accuracy) completes the dual-probe auditor.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Scott Seto

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Context Window Boundary Probing: A Framework for Auditing Epistemic Grounding in Deployed LLM Agents

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider