Generating accurate and contextually relevant captions for images remains a core challenge in vision‐language understanding. To address this, we propose the double‐attention transformer (DAT). This novel image captioning model integrates self‐attention and cross‐attention mechanisms to enhance intramodal feature learning and intermodal semantic alignment. Unlike conventional encoder–decoder models, the DAT model enables richer contextual blending between visual and textual modalities, resulting in more precise and semantically coherent captions. Evaluated under a Flickr8k fine‐tuning regime using standard, publicly available pretrained encoder initializations, the DAT achieves a BLEU‐4 score of 25.6, METEOR score of 22.3, CIDEr score of 79.2, and SPICE score of 15.8. It also consistently outperforms the baseline across precision (0.83), recall (0.78), and F1 score (0.80) metrics. Within this setting, the contribution is architectural, and the results indicate competitive performance among methods evaluated under the same fine‐tuning conditions. These results demonstrate the effectiveness of dual attention mechanisms, particularly in fine‐grained semantic representation and caption generation under low‐resource conditions. The DAT model offers a lightweight yet robust framework suitable for real‐time image captioning applications where computational resources are constrained.
Building similarity graph...
Analyzing shared references across papers
Loading...
Muhammad Aoun
University of the Punjab
Tehseen Mazhar
Government of Pakistan
Tariq Shahzad
University of Engineering and Technology Lahore
Applied Computational Intelligence and Soft Computing
University of Johannesburg
COMSATS University Islamabad
University of the Punjab
Building similarity graph...
Analyzing shared references across papers
Loading...
Aoun et al. (Thu,) studied this question.
synapsesocial.com/papers/69e866896e0dea528ddeaec4 — DOI: https://doi.org/10.1155/acis/5733967
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: