What type of study is this?

September 10, 2025Open Access

End‐to‐End Attention‐Enhanced Transformer for Image Captioning in Biomimetic Wearable Devices

Key Points

The proposed model improves image captioning accuracy by effectively representing visual features and utilizing context.
Achieving a 2.1 percentage point improvement in the CIDEr metric on the MSCOCO dataset demonstrates its effectiveness.
Incorporation of multigranular visual features combines cross-attention and self-attention mechanisms for enhanced encoding.
The model shows real-time efficiency in wearable devices, significantly outperforming state-of-the-art image captioning methods.

Abstract

Existing transformer‐based image captioning methods face two primary limitations: first, they struggle to adequately represent visual features from multiple regions during the encoding phase, and second, the decoder fails to effectively utilize future semantic information during the inference phase. To address these challenges, an attention‐enhanced image captioning model is proposed. During the encoding phase, multigranular visual features are integrated by combining cross‐attention and self‐attention mechanisms, fully utilizing both grid and regional features. Additionally, a novel dense global self‐attention module is introduced to enhance model performance with minimal computational cost by fully leveraging the contextual information and fine‐grained details of the image. This model is particularly well‐suited for biomimetic wearable devices, where real‐time visual assistance plays a crucial role in enhancing the user experience. In the decoding phase, a bidirectional decoding structure with an adaptive masking module is designed to dynamically adjust the focus on past and future semantic information, enabling the model to combine historical and future context effectively for generating more accurate and relevant descriptions. Experimental results on the MSCOCO dataset show that the model outperforms the baseline, achieving a 2.1 percentage point improvement in the CIDEr metric. Comprehensive hardware evaluations on the wearable platform demonstrate real‐time efficiency with minimal memory footprint, significantly outperforming state‐of‐the‐art models in edge deployment scenarios.

End‐to‐End Attention‐Enhanced Transformer for Image Captioning in Biomimetic Wearable Devices

Key Points

Abstract

Cite This Study

Also Consider

Also Consider