This research paper presents an evaluation framework for analyzing document understanding in Vision–Language Models using PDF documents. The proposed framework accepts a document in PDF format, preserves the semantic content, and systematically alters visual layout and formatting attributes such as text alignment, spacing, font styles, and structural organization. Vision–Language Models process these documents and generate responses for tasks including content comprehension, information extraction, and question answering. The framework integrates layout variation, content consistency, and response analysis to evaluate robustness and sensitivity across different document representations. Experimental evaluation demonstrates that model performance varies significantly with layout changes despite identical underlying content, indicating a dependence on visual structure in addition to semantic information
Shreyash Killedar (Wed,) studied this question.