March 3, 2026Open Access

Vision Language Models and Document Understanding

Key Points

Model performance fluctuates with layout variations, emphasizing the importance of visual structure and semantic content.
Evaluation framework systematically addresses text alignment, spacing, and font styles during analysis.
Observational analysis based on changes in PDF document representations reveals significant differences in responses.
Findings highlight that vision-language models depend on both visual and semantic attributes for robust performance.

Abstract

This research paper presents an evaluation framework for analyzing document understanding in Vision–Language Models using PDF documents. The proposed framework accepts a document in PDF format, preserves the semantic content, and systematically alters visual layout and formatting attributes such as text alignment, spacing, font styles, and structural organization. Vision–Language Models process these documents and generate responses for tasks including content comprehension, information extraction, and question answering. The framework integrates layout variation, content consistency, and response analysis to evaluate robustness and sensitivity across different document representations. Experimental evaluation demonstrates that model performance varies significantly with layout changes despite identical underlying content, indicating a dependence on visual structure in addition to semantic information

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Shreyash Killedar (Wed,) studied this question.

synapsesocial.com/papers/69a75bf1c6e9836116a2431e https://doi.org/https://doi.org/10.5281/zenodo.18402831

Bookmark

View Full Paper