What question did this study set out to answer?

The aim is to enhance human reading comprehension by integrating visual data with text through AI.

March 23, 2026Open Access

AI Reading Comprehension Assistants Integrating Visual Data for Enhanced Human Understanding

Key Points

The aim is to enhance human reading comprehension by integrating visual data with text through AI.
Development of the Visual-Integrated Semantic Textual Assistant (VISTA)
Implementation of semantic anchoring and cross-modal attention for document analysis
Evaluation on benchmark multimodal datasets to test reading comprehension capabilities
VISTA showed improved reading comprehension accuracy compared to traditional methods
Enhanced human interpretability observed during document analysis
Increased engagement, better memory retention, and improved decision-making metrics

Abstract

Reading comprehension systems are an integral area of artificial intelligence (AI) as it supports human learning and information processing. Compared to traditional AI reading tools which are only able to read text thanks to spelling and grammar with a few language specific rules, the inclusion of visual graphics and data such as charts, graphs, diagrams etc. forces a cognitive shift towards human processing of multimodal data where visual signals are important. This visual element can be crucial within domains such as direct healthcare interaction, educational learning, and technical documents in a workplace. Traditional methods will separate documents with visual components, while an AI reading tool will only read the text within a document. The way that traditional AI tools process both the text and the image can lead to incomplete mental processes on the user’s behalf and this means a diminished level of interpretability. In this regard, it proposes a new multimodal framework called Visual-Integrated Semantic Textual Assistant (VISTA), which takes the strengths of computer vision and natural language comprehension to document specific passages of text with the most relevant visual indicators. VISTA will perform this document-level semantic alignment using semantic anchoring with cross-modal attention to frame and contextualize visual signals that demonstrate meaning provided by the text. The experimental evaluations on benchmark multimodal datasets showed that VISTA demonstrated improvements to aspects of reading comprehension accuracy and more critical aspect of human interpretability in regards to reading documents and indications. VISTA integrates aspects of visual data within the reading comprehension process in order to foster increased human engagement, better memory, and human decision making, we believe VISTA will provide better usefulness and applicability to multimodal AI reading comprehension systems in the future.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Sinha et al. (Thu,) studied this question.

synapsesocial.com/papers/69c0df0bfddb9876e79c1573 https://doi.org/https://doi.org/10.1016/j.procs.2026.01.026

Bookmark

View Full Paper