Visual Question Answering (VQA) system represents an essential usage of computer vision and natural language processing, enabling machines to understand and react to inquiries about images. This study outlines the architecture and methodologies employed in VQA system. Initially, the image undergoes feature selection and extraction using Dense Convolutional Neural Network (DenseNet) to capture its visual aspects. Simultaneously, LSTM (Long Short-Term Memory), a recurrent neural network interprets the textual question, deciphering its semantic context. The multimodal fusion of visual and textual information with the help of attention mechanism, facilitates the creation of a cohesive representation. ViLBERT (Vision-and-Language BERT), a natural language processing technique leverages this joint representation to infer and generate accurate responses to the posed questions. The complexities of VQA systems lie in managing diverse visual content, resolving ambiguities, and reasoning about relationships within images. This system demonstrates promising applications in aiding the visually impaired, enhancing human-computer interaction, and refining image-based retrieval systems, aim to enhance their accuracy, robustness, and interpretability.
Nithish et al. (Thu,) studied this question.