Existing transformer‐based image captioning methods face two primary limitations: first, they struggle to adequately represent visual features from multiple regions during the encoding phase, and second, the decoder fails to effectively utilize future semantic information during the inference phase. To address these challenges, an attention‐enhanced image captioning model is proposed. During the encoding phase, multigranular visual features are integrated by combining cross‐attention and self‐attention mechanisms, fully utilizing both grid and regional features. Additionally, a novel dense global self‐attention module is introduced to enhance model performance with minimal computational cost by fully leveraging the contextual information and fine‐grained details of the image. This model is particularly well‐suited for biomimetic wearable devices, where real‐time visual assistance plays a crucial role in enhancing the user experience. In the decoding phase, a bidirectional decoding structure with an adaptive masking module is designed to dynamically adjust the focus on past and future semantic information, enabling the model to combine historical and future context effectively for generating more accurate and relevant descriptions. Experimental results on the MSCOCO dataset show that the model outperforms the baseline, achieving a 2.1 percentage point improvement in the CIDEr metric. Comprehensive hardware evaluations on the wearable platform demonstrate real‐time efficiency with minimal memory footprint, significantly outperforming state‐of‐the‐art models in edge deployment scenarios.
Yin et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: