ABSTRACT Single image super‐resolution (SISR) has witnessed remarkable progress with transformer‐based approaches, which effectively model long‐range dependencies and achieve state‐of‐the‐art performance. However, their substantial computational complexity and heavy resource demands severely hinder deployment on resource‐constrained devices and broader real‐world applications. To address these critical limitations, this paper proposes a multiscale mixed transformer (MMT) that significantly improves efficiency while maintaining high reconstruction accuracy. The core architecture consists of three novel components: a high‐frequency preserving block (HFPB) that downsamples feature maps while preserving fine‐grained details, a mixed transformer block (MTB) that efficiently integrates global and local feature information, and a large‐kernel attention tail (LKAT) for enhanced global context modeling. Within the MTB, parameter‐free pixel mixer (PM) layers with pixel‐shift operations replace part of the self‐attention (SA) mechanism to strengthen spatial detail modeling without increasing computational cost, while striped window self‐attention (SWSA) exploits image anisotropy for efficient long‐range dependency capture, and multiscale spatial attention (MSA) effectively fuses multiscale features. Extensive experiments on five benchmark datasets demonstrate that MMT achieves superior performance across scaling factors of , , and , surpassing the second‐best method on the Manga109 dataset by 0.09 dB, 0.17 dB, and 0.09 dB in terms of peak signal‐to‐noise ratio (PSNR), respectively, while producing visually sharper edges and richer textures. The proposed MMT provides a promising direction for efficient and effective SISR.
Zhang et al. (Sun,) studied this question.