In the field of deep learning, the traditional Vision Transformer (ViT) model has some limitations when dealing with local details and long-range dependencies; especially in the absence of sufficient training data, it is prone to overfitting. Structures such as retinal blood vessels and lesion boundaries have distinct fractal properties in medical images. The Fractional Convolution Vision Transformer (FCViT) model is proposed in this paper, which effectively compensates for the deficiency of ViT in local feature capture by fusing convolutional information. The ability to classify medical images is enhanced by analyzing frequency domain features using fractional-order Fourier transform and capturing global information through a self-attention mechanism. The three-branch architecture enables the model to fully understand the data from multiple perspectives, capturing both local details and global context, which in turn improves classification performance and generalization. The experimental results showed that the FCViT model achieved 93.52% accuracy, 93.32% precision, 92.79% recall, and a 93.04% F1-score on the standardized fundus glaucoma dataset. The accuracy on the Harvard Dataverse-V1 dataset reached 94.21%, with a precision of 93.73%, recall of 93.67%, and F1-score of 93.68%. The FCViT model achieves significant performance gains on a variety of neural network architectures and tasks with different source datasets, demonstrating its effectiveness and utility in the field of deep learning.
Sun et al. (Thu,) studied this question.