EViTIB: Efficient Vision Transformer via Inductive Bias Exploration for Image Super-Resolution

Key Points

Key points are not available for this paper at this time.

Abstract

Transformers have exhibited considerable promise in image super-resolution (SR) owing to their capability of establishing long-range dependencies. Nonetheless, vision transformers approach an image as a 1D token sequence, lacking inductive biases to model local visual patterns and scale invariance, which are essential for recovering local details. To address these challenges, we introduce EViTIB, a transformer-based image super-resolution network that integrates the inherent inductive biases of CNNs. EViTIB adopts a concurrent structure where each transformer layer incorporates a convolution branch in parallel with the multi-head self-attention branch. The features from these two branches are subsequently aggregated via a Hybrid Feature Coupling (HFC) module. Consequently, EViTIB takes advantage of locality inductive biases while maintaining the capacity to encompass global dependencies. Extensive experiments demonstrate that, under comparable parameter complexity and FLOPs, EViTIB outperforms recent state-of-the-art SR methods.

Mark Helpful

Bookmark

Relay