Large volumes of images shared on social media have made image captioning an important tool for social hotspot identification and public opinion monitoring, where accurate visual–language alignment is essential for reliable analysis. However, existing image captioning models based on BLIP-2 (Bootstrapped Language–Image Pre-training) often struggle with complex, context-rich, and socially meaningful images in real-world social media scenarios, mainly due to insufficient cross-modal interaction, redundant visual token representations, and an inadequate ability to capture multi-scale semantic cues. As a result, the generated captions tend to be incomplete or less informative. To address these limitations, this paper proposes ECMA (Enhanced Cross-Modal Attention), a lightweight module integrated into the Querying Transformer (Q-Former) of BLIP-2. ECMA enhances cross-modal interaction through bidirectional attention between visual features and query tokens, enabling more effective information exchange, while a multi-scale visual aggregation strategy is introduced to model semantic representations at different levels of abstraction. In addition, a semantic residual gating mechanism is designed to suppress redundant information while preserving task-relevant features. ECMA can be seamlessly incorporated into BLIP-2 without modifying the original architecture or fine-tuning the vision encoder or the large language model, and is fully compatible with OPT (Open Pre-trained Transformer)-based variants. Experimental results on the COCO (Common Objects in Context) benchmark demonstrate consistent performance improvements, where ECMA improves the CIDEr (Consensus-based Image Description Evaluation) score from 144.6 to 146.8 and the BLEU-4 score from 42.5 to 43.9 on the OPT-6.7B model, corresponding to relative gains of 1.52% and 3.29%, respectively, while also achieving competitive METEOR (Metric for Evaluation of Translation with Explicit Ordering) scores. Further evaluations on social media datasets show that ECMA generates more coherent, context-aware, and socially informative captions, particularly for images involving complex interactions and socially meaningful scenes.
Jiang et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: