What question did this study set out to answer?

This research aims to determine if log-probabilities from language models can indicate how social-moral judgments are resolved.

June 22, 2026Open Access

Reading "Friction" in LLM Logprobs: A Substrate-Level Correlate of Present Route-Competition in Social-Moral Judgments

Key Points

This research aims to determine if log-probabilities from language models can indicate how social-moral judgments are resolved.
Conducted three experiments examining moral commitment and route-competition signals.
Analyzed data from language models, focusing on log-probabilities and internal disparities.
Controlled for variables such as fairness reactions and moral scripts across experiments.
E1: Moral commitments were specific and robust, with measurable friction in ambiguous judgments.
E2: Binary rule-evaluation revealed generalized capability effects without defining social-moral signatures.
E3: Disengagement in disadvantaged agents was influenced more by comparison rather than reward amount.

Abstract

Can a single token-level signal read from a language model’s log-probabilities — the count of competing routes, the commit-margin in nats, the first-token entropy — act as a readable correlate of how a model resolves a social-moral judgment? This paper reads such a signal across three experiments and delimits, carefully, what it does and does not establish: it is a substrate-level correlate of present route-competition, not an established causal mechanism and not a claim that the model feels anything. Three experiments, with controls. (E1) Third-party moral commitment is morality-specific and lexeme-robust, with a friction difference that appears in the ambiguous region where the judgment is not pre-resolved. (E2) Content-invariant binary rule-evaluation generalises up an abstraction gradient at capability — a capability control that bounds the signal, not a social-moral signature. (E3) A disadvantaged agent’s disengagement is driven by comparison and expectation rather than reward amount, with a human-like advantaged/disadvantaged asymmetry; and — read internally rather than at the output token — the inequity reaction is strongly predicted by a non-morally-defined internal disparity axis (a gap-controlled, context-general dose-response, made non-circular by construction), consistent with a fairness reaction that imports another agent’s unresolved competition rather than reciting a learned moral script. Every contrast carries a bootstrap confidence interval; a lexeme-invariance check distinguishes robust effects from answer-token artefacts; and the limits (two models, pilot scale, a correlational interpretability readout for the internal result) are stated rather than hidden. Companion papers in the series develop the underlying friction theory, the forward-modelling and operational accounts, the competing-routes measurement-model programme, and the mechanism home for the mirror-friction reading of fairness. Prepared for submission to Transactions on Machine Learning Research (TMLR). Data and code. The stimulus generators, probes, re-analysis scripts, and per-token log-probability outputs are available from the author.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper