What question did this study set out to answer?

The study aims to track epistemic dynamics during text generation in large language models to understand how certainty and hallucination manifest.

March 16, 2026Open Access

Streaming Epistemic Geometry in Large Language Models: Token-Level Dynamics of Certainty, Hallucination, and Refusal Across Five Model Families

Key Points

The study aims to track epistemic dynamics during text generation in large language models to understand how certainty and hallucination manifest.
Introduced streaming epistemic geometry for token-by-token tracking in autoregressive generations.
Applied PCA-based subspace analysis on five independently trained model families.
Used a logistic classifier trained on first-token projection scores for evaluation.
Distinct dynamic signatures for hallucination, refusal, and certainty were identified from the first token.
Achieved an AUC of 0.991 for the logistic classifier on Llama-3.1-8B with successful zero-shot transfer to TruthfulQA.
Subspace methods flagged factual citation errors, while output entropy detected physically improbable myths.

Abstract

We introduce streaming epistemic geometry — the first token-by-token tracking of epistemic subspace projections during autoregressive generation in large language models. Using PCA-based subspace analysis on five independently trained model families (Llama-3.1-8B, Mistral-7B, Gemma-2-9B, Qwen2.5-7B, Llama-3.2-3B; 4 organisations, 3B–9B parameters), we show that hallucination, refusal, and certainty each produce a distinct dynamic signature in the residual stream detectable from the very first generated token. A logistic classifier trained on the first-token projection score achieves leave-one-out AUC = 0.991 on Llama-3.1-8B and transfers zero-shot to TruthfulQA. Our geometric detector and an output-entropy baseline capture complementary failure modes: the subspace method flags factual-citation errors while entropy flags physically improbable myths. All code and data included for full reproducibility.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper