The paper proposes a novel Cross-Modal Latent Interaction Network (CLIN) for Visual Question Answering (VQA), which aims to provide computationally efficient yet robust multimodal reasoning capabilities to empower interactive English learning applications by facilitating more effective interaction between text and perceptual modalities. VQA, a task that involves interpreting and answering natural language questions based on images or videos, is an increasingly significant area of research in the intersection of Natural Language Processing (NLP) and Computer Vision (CV). While VQA systems have traditionally focused on answering questions related to perceptual content, this work extends VQA capabilities by integrating a cross-modal latent interaction approach that enables more accurate representation and understanding of both text and image features. The proposed model uses two types of image representations, i.e., BBox-wise and cell-wise, to capture object-level and global contextual information, which are further processed using an refined Cross-Representation Interaction Decoder (CRID) and a specialized Image-Aware Question Re-Instruction Decoder (IAQRD). These modules not only improve VQA performance but also demonstrate strong potential for Computer-Assisted Language Learning (CALL) by accurately grounding language in visual contexts, which theoretically supports contextual vocabulary acquisition in real-world scenarios. Experiments on different benchmark datasets demonstrate the superior performance of the proposed CLIN model compared to representative modular-based architectures. We specifically focus on comparing CLIN against models of similar computational scale to evaluate the effectiveness of our proposed interaction mechanisms without the influence of massive parameter scaling found in foundation models.
Wang et al. (Fri,) studied this question.