What question did this study set out to answer?

The study aims to improve visual information extraction from documents by leveraging a classification-guided large vision-language model.

May 5, 2026

Visual information extraction from documents via classification-guided large vision-language models.

Key Points

The study aims to improve visual information extraction from documents by leveraging a classification-guided large vision-language model.
Proposed a framework separating document-type classification from content extraction.
Utilized in-context learning for dynamic prompt engineering.
Evaluated the model on a real-world dataset with 16 certificate types.
Achieved an F1-score of 86.43%, outperforming a supervised baseline by 18.35 percentage points.
Demonstrated a normalized edit distance improvement to 0.90 compared to 0.67 from the baseline.
Optional fine-tuning led to a further performance increase to 93.65% F1 and 0.93 NED.

Abstract

Visual information extraction (VIE) from visually rich documents remains challenging due to high layout variability and real-world impairments. Existing methods typically rely on sequential OCR pipelines or end-to-end models requiring extensive labeled data and layout-specific training, limiting their scalability. We propose a classification-guided large vision-language model (LVLM) framework for multi-type VIE that achieves high accuracy with minimal supervision. The approach decouples document-type classification from content extraction and employs in-context learning (ICL)-based dynamic prompt engineering to inject task-specific knowledge, enabling robust zero-shot inference across diverse layouts. From a theoretical perspective, the proposed method can be viewed as a form of conditional computation that reduces task uncertainty and improves information efficiency during prompt-based inference. Evaluated on a real-world bidding dataset with 16 certificate types, our zero-shot method (based on Qwen2.5-VL-7B) outperforms a strong supervised baseline by 18.35 percentage points in F1-score (86.43% vs. 68.08%) and 0.23 in normalized edit distance (0.90 vs. 0.67). Optional domain-specific fine-tuning further improves performance to 93.65% F1 and 0.93 NED, demonstrating superior robustness against seals, watermarks, and low contrast. The framework offers an efficient, scalable solution for complex document understanding in office automation. Code is available at https://github.com/FairmeHIT/Multi-VIE, and fine-tuned models at https://huggingface.co/fairme/Qwen2.5-VL-7B-SFT.

Bookmark

View Full Paper