What question did this study set out to answer?

The aim is to improve fine-grained visual classification by addressing key limitations in traditional models.

February 2, 2026Open Access

Token Injection Transformer for Enhanced Fine-Grained Recognition

Key Points

The aim is to improve fine-grained visual classification by addressing key limitations in traditional models.
Proposed a Gradient-Aware Token Injection Transformer framework.
Incorporated gradient magnitude and orientation into token embeddings.
Utilized multi-modal feature fusion to enhance model performance.
Conducted experiments on four standard FGVC benchmarks.
Achieved 92.9% top-1 accuracy on CUB-200-2011.
Reached 90.5% accuracy on iNaturalist 2018.
Obtained 93.2% accuracy on NABirds.
Achieved 95.3% accuracy on Stanford Cars.

Abstract

Fine-Grained Visual Classification (FGVC) involves distinguishing highly similar subordinate categories within the same basic-level class, presenting significant challenges due to subtle inter-class variations and substantial intra-class diversity. While Vision Transformer (ViT)-based approaches have demonstrated potential in this domain, they remain limited by two key issues: (1) the progressive loss of gradient-based edge and texture signals during hierarchical token aggregation and (2) insufficient extraction of discriminative fine-grained features. To overcome these limitations, we propose a Gradient-Aware Token Injection Transformer, a novel framework that explicitly incorporates gradient magnitude and orientation into token embeddings. This multi-modal feature fusion mechanism enhances the model’s capacity to preserve and leverage critical fine-grained visual cues. Extensive experiments on four standard FGVC benchmarks demonstrate the superiority of our approach, achieving 92.9% top-1 accuracy on CUB-200-2011, 90.5% on iNaturalist 2018, 93.2% on NABirds, and 95.3% on Stanford Cars, thereby validating its effectiveness and robustness.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Ma et al. (Fri,) studied this question.

synapsesocial.com/papers/6980fe8ac1c9540dea810a74 https://doi.org/https://doi.org/10.3390/pr14030492

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper