What question did this study set out to answer?

This research aims to improve medical Visual Question Answering (VQA) by using image captioning to enhance the understanding of medical images.

October 10, 2022

Caption-Aware Medical VQA via Semantic Focusing and Progressive Cross-Modality Comprehension

Key Points

This research aims to improve medical Visual Question Answering (VQA) by using image captioning to enhance the understanding of medical images.
Developed a caption-aware VQA method integrating image content summaries and clinical diagnoses.
Implemented a similarity analysis for guiding attention on significant visual regions based on image captions.
Created a Progressive Compact Bilinear Interactions structure to facilitate cross-modality comprehension among image, question, and caption features.
Achieved superior performance on various medical datasets compared to existing state-of-the-art methods.
Enhanced accuracy of responses through improved utilization of semantic locations and content from captions.
Demonstrated effective multimodal feature integration leading to better question answering outcomes in medical contexts.

Abstract

Medical Visual Question Answering as a specific-domain task requires substantive prior knowledge of medicine. However, deep learning techniques encounter severe problems of limited supervision due to the scarcity of well-annotated large-scale medical VQA datasets. As an alternative to facing the data limitation problem, image captioning can be introduced to learn summary information about the picture, which is beneficial to question answering. To this end, we propose a caption-aware VQA method that can read the summary information of image content and clinic diagnoses from plenty of medical images and answer the medical question with richer multimodality features. The proposed method consists of two novel components emphasizing semantic locations and semantic content respectively. Firstly, to extract and leverage the semantic locations implied in image captioning, similarity analysis is designed to summarize the attention maps generated from image captioning by their relevance and guide the visual model to focus on the semantic-rich regions. Besides, to combine the semantic content in the generated captions, we propose a Progressive Compact Bilinear Interactions structure to achieve cross-modality comprehension over the image, question and caption features by performing bilinear attention in a gradual manner. Qualitative and quantitative experiments on various medical datasets exhibit the superiority of the proposed approach compared to the state-of-the-art methods.

Bookmark

Caption-Aware Medical VQA via Semantic Focusing and Progressive Cross-Modality Comprehension

Key Points

Abstract

Cite This Study

Also Consider

Also Consider