Entity linking in visually rich documents aims to identify semantic relationships between entities (e.g., key–value pairs) by jointly leveraging textual, visual, and spatial information. Despite the success of pre-trained document models such as LayoutLMv3, two challenges remain for relation extraction: (1) spatial position signals injected only at the input embedding layer tend to decay in deeper transformer layers, weakening the model’s ability to capture layout-dependent entity associations; and (2) in long documents, softmax attention distributes weights across many irrelevant tokens, diluting the focus on informative regions. To address these issues, we propose Gated Spatial Attention (GSA) , a lightweight, plug-in framework on top of LayoutLMv3 that comprises two complementary modules: Spatial Position Enhancement (SPE) , which injects ALiBi-style linear biases into every attention layer with head groups specialized for reading-order, horizontal, vertical, and semantic proximity, and Gated Attention (GA) , which applies a per-token scalar gate after the scaled dot-product attention to suppress outputs from irrelevant tokens.Experiments on FUNSD and CORD demonstrate that GSA consistently improves both semantic entity recognition and relation extraction, achieving state-of-the-art results with negligible computational overhead.
Wang et al. (Wed,) studied this question.