What question did this study set out to answer?

The aim is to enhance single image super-resolution (SISR) by improving efficiency without sacrificing reconstruction accuracy.

March 13, 2026

Multiscale Mixed Transformer for Single Image Super‐Resolution

Key Points

The aim is to enhance single image super-resolution (SISR) by improving efficiency without sacrificing reconstruction accuracy.
Developed a multiscale mixed transformer (MMT) to reduce computational complexity.
Introduced a high-frequency preserving block (HFPB) to maintain fine details.
Implemented a mixed transformer block (MTB) integrating pixel mixer layers for spatial detail.
Utilized striped window self-attention (SWSA) for efficient long-range dependencies.
Employed multiscale spatial attention (MSA) to fuse features effectively.
MMT outperformed existing models across several benchmark datasets.
Achieved notable improvements in peak signal-to-noise ratio (PSNR) by 0.09 to 0.17 dB over competitors.
Produced visually sharper edges and richer textures in generated images.

Abstract

ABSTRACT Single image super‐resolution (SISR) has witnessed remarkable progress with transformer‐based approaches, which effectively model long‐range dependencies and achieve state‐of‐the‐art performance. However, their substantial computational complexity and heavy resource demands severely hinder deployment on resource‐constrained devices and broader real‐world applications. To address these critical limitations, this paper proposes a multiscale mixed transformer (MMT) that significantly improves efficiency while maintaining high reconstruction accuracy. The core architecture consists of three novel components: a high‐frequency preserving block (HFPB) that downsamples feature maps while preserving fine‐grained details, a mixed transformer block (MTB) that efficiently integrates global and local feature information, and a large‐kernel attention tail (LKAT) for enhanced global context modeling. Within the MTB, parameter‐free pixel mixer (PM) layers with pixel‐shift operations replace part of the self‐attention (SA) mechanism to strengthen spatial detail modeling without increasing computational cost, while striped window self‐attention (SWSA) exploits image anisotropy for efficient long‐range dependency capture, and multiscale spatial attention (MSA) effectively fuses multiscale features. Extensive experiments on five benchmark datasets demonstrate that MMT achieves superior performance across scaling factors of , , and , surpassing the second‐best method on the Manga109 dataset by 0.09 dB, 0.17 dB, and 0.09 dB in terms of peak signal‐to‐noise ratio (PSNR), respectively, while producing visually sharper edges and richer textures. The proposed MMT provides a promising direction for efficient and effective SISR.

Mark Helpful

Bookmark

Relay