Key points are not available for this paper at this time.
We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking exponential decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sharma et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e639e5b6db6435875cb5c6 — DOI: https://doi.org/10.48550/arxiv.2406.16851
Aditya Sharma
Michael Saxon
William Yang Wang
Building similarity graph...
Analyzing shared references across papers
Loading...