What question did this study set out to answer?

This research aims to improve image captioning quality by utilizing a novel double-attention mechanism.

April 22, 2026Open Access

Double‐Attention Transformer for Cross‐Modal Image Captioning: Enhancing Visual–Linguistic Alignment on Low‐Resource Datasets

Key Points

This research aims to improve image captioning quality by utilizing a novel double-attention mechanism.
Developed a double-attention transformer model combining self-attention and cross-attention mechanisms.
Evaluated the model using the Flickr8k dataset with publicly available pretrained encoders.
Compared performance metrics with conventional encoder-decoder models.
The DAT achieved BLEU-4 of 25.6, METEOR of 22.3, CIDEr of 79.2, and SPICE of 15.8.
Outperformed baselines with precision of 0.83, recall of 0.78, and F1 score of 0.80.
Demonstrated effectiveness in fine-grained semantic representation, especially in low-resource settings.

Abstract

Generating accurate and contextually relevant captions for images remains a core challenge in vision‐language understanding. To address this, we propose the double‐attention transformer (DAT). This novel image captioning model integrates self‐attention and cross‐attention mechanisms to enhance intramodal feature learning and intermodal semantic alignment. Unlike conventional encoder–decoder models, the DAT model enables richer contextual blending between visual and textual modalities, resulting in more precise and semantically coherent captions. Evaluated under a Flickr8k fine‐tuning regime using standard, publicly available pretrained encoder initializations, the DAT achieves a BLEU‐4 score of 25.6, METEOR score of 22.3, CIDEr score of 79.2, and SPICE score of 15.8. It also consistently outperforms the baseline across precision (0.83), recall (0.78), and F1 score (0.80) metrics. Within this setting, the contribution is architectural, and the results indicate competitive performance among methods evaluated under the same fine‐tuning conditions. These results demonstrate the effectiveness of dual attention mechanisms, particularly in fine‐grained semantic representation and caption generation under low‐resource conditions. The DAT model offers a lightweight yet robust framework suitable for real‐time image captioning applications where computational resources are constrained.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Muhammad Aoun

University of the Punjab

Tehseen Mazhar

Government of Pakistan

Tariq Shahzad

University of Engineering and Technology Lahore

Journals

Applied Computational Intelligence and Soft Computing

Actions

Institutions

University of Johannesburg

COMSATS University Islamabad

University of the Punjab

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Double‐Attention Transformer for Cross‐Modal Image Captioning: Enhancing Visual–Linguistic Alignment on Low‐Resource Datasets

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider