The synergy between foveal and peripheral processing is fundamental to the efficiency of biological vision. While hybrid Convolutional Neural Network (CNN)-Transformer architectures aim to capture both local and global features, they often rely on static, predefined structures that struggle to dynamically align information and adaptively allocate computational resources, ultimately limiting their performance. To address this limitation, we introduce the Central-Peripheral Vision Transformer (CPVT), a novel architecture that explicitly and hierarchically mimics this biological dichotomy. CPVT employs fine-grained, convolutionally modulated attention in its shallow layers to emulate foveal vision, while seamlessly transitioning to a coarse-grained, global attention mechanism in deeper layers to emulate peripheral vision. This design is enhanced by two specialized Feed-Forward Networks that facilitate synergistic information interaction. Rigorously validated on diverse medical imaging benchmarks, CPVT achieves state-of-the-art performance, attaining classification accuracies of 87.98% on the International Skin Imaging Collaboration (ISIC) 2018 challenge dataset and 90.41% on the Kvasir dataset. These results demonstrate that an adaptive, hierarchical integration of biological vision principles can significantly enhance machine perception for medical image analysis. • Propose a novel Central-Peripheral Vision Transformer (CPVT) for medical imaging. • New attention modules mimic biological vision for adaptive feature fusion. • Specialized FFNs enhance hierarchical processing of local and global features. • Achieves state-of-the-art performance and robust generalization on diverse medical datasets, including ISIC2018 and Kvasir.
Rui et al. (Fri,) studied this question.