Different humans perceive a scene through distinct visual attention shifts that can be represented by scanpaths of different varieties. Approaches that predict multiple visual scanpaths on an image must thus consider producing scanpaths of distinct varieties for human-like generation, which has been mostly overlooked in the existing literature. In this paper, we introduce ScanFormer, a framework to predict diverse visual scanpaths of different varieties on an image employing a meshed-memory transformer. The memory-augmented encoder of the transformer generates multi-level contextual features that capture the relationships among image regions and embed learned biases towards them. The meshed decoder of the transformer models inter-fixation dependencies to successively predict the fixations of the output scanpath, by taking the features from the encoder, previous fixations and a condition representing the variety of the scanpath as the inputs. The generation of multiple diverse visual scanpaths on an image is facilitated by learning to embed scanpath variety as a representation related to a scanpath's uniqueness among multiple scanpaths. We evaluate the proposed approach on three standard datasets in terms of five types of established quantitative measures. Saccade amplitude and orientation density plots are also considered in the performance analysis. The experimental results demonstrate the superiority of ScanFormer over state-of-the-art methods in generating multiple diverse human-like visual scanpaths on images. Further, an ablation study is provided to empirically establish the significance of the various components of our framework. Code link: https://github.com/ashishverma03/ScanFormer
Verma et al. (Sat,) studied this question.