What question did this study set out to answer?

The aim is to enhance the scalability and efficiency of Vision Transformers by introducing a linear complexity attention mechanism.

May 26, 2026Open Access

DS2 Attention: Dual-Stream Segmented Information Propagating Linear Attention for Vision Transformers

Key Points

The aim is to enhance the scalability and efficiency of Vision Transformers by introducing a linear complexity attention mechanism.
Developed a dual-stream attention mechanism that propagates information bidirectionally across segments.
Implemented segment-level classification with summary tokens for improved predictions.
Conducted extensive experiments using the ImageNet-1K dataset to evaluate performance.
Achieved an average accuracy increase of 0.3% on the ImageNet-1K dataset.
Enhanced information flow and structural efficiency compared to standard attention mechanisms in Vision Transformers.

Abstract

While Vision Transformers (ViTs) have achieved state-of-the-art (SOTA) results in visual recognition, their scalability remains fundamentally constrained by the quadratic complexity of global self-attention. To address this, we present a linear complexity attention design employing dual-stream information propagation to enhance representational efficiency and structured feature aggregation. Our proposed DS2 attention acts as a versatile replacement for standard attention in various SOTA designs, such as Tokens-to-Token (T2T) and FasterViT. In our design, half of the attention heads perform left-to-right segmented information propagation in a Perceiver-style manner, while the remaining half of the heads perform right-to-left propagation. This bidirectional structured attention enables efficient long-range dependency modeling without the overhead of full global attention. To improve classification performance, we introduce a segment-level classification strategy in which each segment is associated with a summary token. The final prediction is produced via cross-attention between image tokens and these summary tokens, enabling hierarchical semantic comprehension. Extensive experiments demonstrate that the proposed attention design achieves on average 0.3% higher accuracy on the ImageNet-1K dataset, while offering improved information flow and higher efficiency across SOTA Vision Transformer designs.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper