Reading comprehension systems are an integral area of artificial intelligence (AI) as it supports human learning and information processing. Compared to traditional AI reading tools which are only able to read text thanks to spelling and grammar with a few language specific rules, the inclusion of visual graphics and data such as charts, graphs, diagrams etc. forces a cognitive shift towards human processing of multimodal data where visual signals are important. This visual element can be crucial within domains such as direct healthcare interaction, educational learning, and technical documents in a workplace. Traditional methods will separate documents with visual components, while an AI reading tool will only read the text within a document. The way that traditional AI tools process both the text and the image can lead to incomplete mental processes on the user’s behalf and this means a diminished level of interpretability. In this regard, it proposes a new multimodal framework called Visual-Integrated Semantic Textual Assistant (VISTA), which takes the strengths of computer vision and natural language comprehension to document specific passages of text with the most relevant visual indicators. VISTA will perform this document-level semantic alignment using semantic anchoring with cross-modal attention to frame and contextualize visual signals that demonstrate meaning provided by the text. The experimental evaluations on benchmark multimodal datasets showed that VISTA demonstrated improvements to aspects of reading comprehension accuracy and more critical aspect of human interpretability in regards to reading documents and indications. VISTA integrates aspects of visual data within the reading comprehension process in order to foster increased human engagement, better memory, and human decision making, we believe VISTA will provide better usefulness and applicability to multimodal AI reading comprehension systems in the future.
Sinha et al. (Thu,) studied this question.