Recent vision–language models (VLMs) achieve strong performance across multimodal benchmarks but suffer from high inference costs due to the large number of visual tokens. Prior studies have shown that many image tokens receive consistently low attention scores during inference, indicating that a substantial portion of visual content contributes little to final predictions. These observations raise questions about the efficiency of conventional token pruning strategies, which are typically applied after all attention operations and depend on late-emerging attention scores. To address this, we propose V-PRUNE, a semantic-aware patch-level pruning framework for vision–language models that removes redundant content before tokenization. By evaluating local similarity via color and histogram statistics, our method enables lightweight and interpretable pruning without architectural changes. Applied to CLIP-based models, our approach reduces FLOPs and inference time across vision–language understanding tasks, while maintaining or improving accuracy. Qualitative results further confirm that essential regions are preserved and the pruning behavior is human-aligned, making our method a practical solution for efficient VLM inference.
Seo et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: