August 21, 2025Open Access

Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

Key Points

The model achieves high CIDEr scores of 142.6 on MSCOCO and 78.4 on Flickr30k, indicating significant improvements in description quality.
It utilizes CLIP's ViT visual encoder and employs Faster R-CNN for region-based feature extraction, enhancing image analysis.
Multi-step cross-attention refines the correspondence between images and text, ensuring semantically rich captions.
External commonsense knowledge embedding boosts the factual accuracy and lexical richness of generated descriptions.

Abstract

To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and external commonsense knowledge enhancement. The model employs a backbone architecture comprising CLIP’s ViT visual encoder, Faster R-CNN, BERT text encoder, and GPT-2 text decoder. It incorporates two core mechanisms: a multi-step cross-attention mechanism that iteratively aligns image and text features across multiple rounds, progressively enhancing inter-modal semantic consistency for more accurate cross-modal representation fusion. Moreover, the model employs Faster R-CNN to extract region-based object features. These features are mapped to corresponding entities within the dataset through entity probability calculation and entity linking. External commonsense knowledge associated with these entities is then retrieved from the ConceptNet knowledge graph, followed by knowledge embedding via TransE and multi-hop reasoning. Finally, the fused multimodal features are fed into the GPT-2 decoder to steer caption generation, enhancing the lexical richness, factual accuracy, and cognitive plausibility of the generated descriptions. In the experiments, the model achieves CIDEr scores of 142.6 on MSCOCO and 78.4 on Flickr30k. Ablations confirm both modules enhance caption quality.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Wang et al. (Thu,) studied this question.

synapsesocial.com/papers/68af5701ad7bf08b1eadd4f2 https://doi.org/https://doi.org/10.3390/electronics14163325

Bookmark

View Full Paper