Autoregressive transformer language models apply causal attention masking during both generation and prompt processing (prefill). During generation, causal masking is architecturally necessary: each token genuinely depends on the preceding tokens having been selected. During prefill, however, the entire input sequence already exists. Causal masking during prefill is not a necessity but a choice, one inherited from the generation architecture without systematic examination of its consequences. This paper argues that the consequences are substantial and linguistically asymmetric. Languages that place semantically critical information late in the clause, head-final languages including German, Japanese, Korean, Turkish, Hindi, and Latin, are structurally disadvantaged by causal prefill masking, because early-position tokens cannot attend to late-arriving semantic anchors. The model's representation of a German subordinate clause's subject, computed at position 3, is permanently impoverished because it cannot attend to the verb at position 15, even though that verb is already present in the input and known to the system. The paper characterises the problem, traces its origins to a path-dependency in architectural design, proposes interventions (bidirectional prefill, pragmatic frame fronting, two-pass processing), and outlines empirical tests using activation-level observation. The core claim is that the causal mask during prefill is an engineering convenience mistaken for an architectural requirement, with measurable costs for a majority of the world's languages.
Storm Bjørn Temte (Tue,) studied this question.