What question did this study set out to answer?

The aim is to develop a framework for predicting diverse visual scanpaths in a way that mimics human attention shifts.

May 25, 2026

ScanFormer: Transformer-based Prediction of Multiple Visual Scanpaths of Different Varieties

Key Points

The aim is to develop a framework for predicting diverse visual scanpaths in a way that mimics human attention shifts.
Introduced ScanFormer, a framework utilizing a meshed-memory transformer for scanpath prediction.
Employed a memory-augmented encoder to generate contextual features and predict the sequence of fixations.
Evaluated on three datasets with quantitative measures including saccade amplitude and orientation density plots.
ScanFormer significantly outperformed existing methods in generating diverse human-like visual scanpaths.
Demonstrated improved prediction capabilities across multiple visual scanpath types with established performance metrics.
The ablation study confirmed the importance of the various components of the ScanFormer framework.

Abstract

Different humans perceive a scene through distinct visual attention shifts that can be represented by scanpaths of different varieties. Approaches that predict multiple visual scanpaths on an image must thus consider producing scanpaths of distinct varieties for human-like generation, which has been mostly overlooked in the existing literature. In this paper, we introduce ScanFormer, a framework to predict diverse visual scanpaths of different varieties on an image employing a meshed-memory transformer. The memory-augmented encoder of the transformer generates multi-level contextual features that capture the relationships among image regions and embed learned biases towards them. The meshed decoder of the transformer models inter-fixation dependencies to successively predict the fixations of the output scanpath, by taking the features from the encoder, previous fixations and a condition representing the variety of the scanpath as the inputs. The generation of multiple diverse visual scanpaths on an image is facilitated by learning to embed scanpath variety as a representation related to a scanpath's uniqueness among multiple scanpaths. We evaluate the proposed approach on three standard datasets in terms of five types of established quantitative measures. Saccade amplitude and orientation density plots are also considered in the performance analysis. The experimental results demonstrate the superiority of ScanFormer over state-of-the-art methods in generating multiple diverse human-like visual scanpaths on images. Further, an ablation study is provided to empirically establish the significance of the various components of our framework. Code link: https://github.com/ashishverma03/ScanFormer

Bookmark

Cite This Study

Verma et al. (Sat,) studied this question.

synapsesocial.com/papers/6a13e8030e02ee3982d32af1 https://doi.org/https://doi.org/10.1145/3817047

Bookmark