The Transformer, with its global self-attention mechanism, has become a foundational architecture for natural language processing and general sequence modeling. However, the quadratic time and space complexity of standard self-attention poses significant computational and memory bottlenecks for long-sequence scenarios. At the same time, the parameter explosion caused by deep stacking limits deployability under resource-constrained conditions. Existing research typically alleviates these issues from two separate directions: one line of work reduces attention complex-itythroughsparsification, low-rankapproximation, orkernelmethods; an-otherlinereducesparameterredundancyviacross-layer parameter sharing or recurrent updates. The problem is that these two technical routes are mostly independent, lacking a unified framework that simultaneously addresses computational efficiency, parameter efficiency, and deep representational power. The proposal of the Transformer and its sub-sequent efficient variants, including Reformer, Longformer, BigBird, Per-former, Linformer, as well as parameter-sharing approaches like Universal Transformer and ALBERT, collectively form the direct background of this work.
Yizhou Huang (Sun,) studied this question.