We propose a simple architectural modification to transformer attention blocks: inserting a small nonlinear MLP between the layer norm and Q/K/V projections. This pre-projection operates in a position-agnostic manner—it constructs richer feature representations from token content alone, before any positional encoding (e.g., RoPE) is applied. In frozen-probe experiments on Pythia-160M and Pythia-410M, training only the pre-projection parameters while keeping the base model frozen yields consistent improvements across all benchmarks. At 160M scale, the pre-projection outperforms LoRA on LAMBADA (0.154 vs. 0.126) despite using fewer parameters. Combining pre-projection with a small LoRA achieves the best results overall—matching a 10× larger standalone LoRA on perplexity while preserving comprehensiongains. The pre-projection adds no K/V cache overhead and its position-agnostic design makes it naturally suited to multimodal architectures where different modalities have fundamentally different notions of position.
Shinde Chirag (Fri,) studied this question.