What type of study is this?

September 10, 2025Open Access

V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference

Key Points

V-PRUNE effectively reduces inference time while maintaining or improving accuracy in vision-language models.
The method achieves significant reductions in FLOPs, enhancing efficiency in multimodal benchmarks without architectural changes.
By evaluating local similarity, V-PRUNE prunes redundant content before tokenization, improving interpretability.
Qualitative results further confirm that vital image regions remain intact and aligned with human perception.

Abstract

Recent vision–language models (VLMs) achieve strong performance across multimodal benchmarks but suffer from high inference costs due to the large number of visual tokens. Prior studies have shown that many image tokens receive consistently low attention scores during inference, indicating that a substantial portion of visual content contributes little to final predictions. These observations raise questions about the efficiency of conventional token pruning strategies, which are typically applied after all attention operations and depend on late-emerging attention scores. To address this, we propose V-PRUNE, a semantic-aware patch-level pruning framework for vision–language models that removes redundant content before tokenization. By evaluating local similarity via color and histogram statistics, our method enables lightweight and interpretable pruning without architectural changes. Applied to CLIP-based models, our approach reduces FLOPs and inference time across vision–language understanding tasks, while maintaining or improving accuracy. Qualitative results further confirm that essential regions are preserved and the pruning behavior is human-aligned, making our method a practical solution for efficient VLM inference.

V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference

Key Points

Abstract

Cite This Study

Also Consider

Also Consider