What question did this study set out to answer?

The aim is to enhance transformer attention by introducing a pre-projection layer that is position-agnostic.

April 12, 2026Open Access

Position-Agnostic Pre-Projection for Transformer Attention: Nonlinear Feature Construction and Content Skip Before Q/K/V

Key Points

The aim is to enhance transformer attention by introducing a pre-projection layer that is position-agnostic.
Inserted a nonlinear MLP between layer norm and Q/K/V projections in transformer attention blocks.
Conducted frozen-probe experiments on Pythia-160M and Pythia-410M.
Trained only the pre-projection parameters while keeping the base model frozen.
Pre-projection outperformed LoRA on LAMBADA with a score of 0.154 compared to 0.126.
Combining pre-projection with a small LoRA achieved the best results without increasing K/V cache overhead.
Pre-projection preserved comprehension gains while matching the performance of a 10× larger standalone LoRA.

Abstract

We propose a simple architectural modification to transformer attention blocks: inserting a small nonlinear MLP between the layer norm and Q/K/V projections. This pre-projection operates in a position-agnostic manner—it constructs richer feature representations from token content alone, before any positional encoding (e.g., RoPE) is applied. In frozen-probe experiments on Pythia-160M and Pythia-410M, training only the pre-projection parameters while keeping the base model frozen yields consistent improvements across all benchmarks. At 160M scale, the pre-projection outperforms LoRA on LAMBADA (0.154 vs. 0.126) despite using fewer parameters. Combining pre-projection with a small LoRA achieves the best results overall—matching a 10× larger standalone LoRA on perplexity while preserving comprehensiongains. The pre-projection adds no K/V cache overhead and its position-agnostic design makes it naturally suited to multimodal architectures where different modalities have fundamentally different notions of position.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Shinde Chirag (Fri,) studied this question.

synapsesocial.com/papers/69db37f94fe01fead37c623d https://doi.org/https://doi.org/10.5281/zenodo.19498159

Bookmark

View Full Paper