Benefiting from the effectiveness of the self-attention mechanisms in the Transformer framework for modeling non local features of images, significant progress has been achieved in image super-resolution. We note that existing self-attention mechanisms usually explore all similarities of the tokens between the queries and keys for the feature aggregation. However, using all the similarities does not effectively facilitate the high-quality image reconstruction as not all the tokens from the queries are relevant to those in keys. We further note that self-attention mechanisms are less effective for local feature exploration, which are less effective for the structural detail restoration. To overcome these problems, we develop a simple yet effective adaptive sparse self-attention method to utilize the most useful information of tokens for image restoration. We first develop a local spatial variant feature estimation method to build the query and key used in the self-attention so that local information can be better modeled. Then, we present a simple yet effective sparse self-attention to adaptively select the most useful similarity values from the self-attention matrix for better the feature aggregation. We analyze that the proposed method models both local and non local features and thus facilitates better structural detail restoration. We further show that the proposed method can serve as an alternative to existing self-attention mechanisms for better image restoration. Experimental results show that the proposed method performs favorably against state-of-the-art ones on benchmark datasets in terms of accuracy and model complexity.
pan et al. (Thu,) studied this question.