What type of study is this?

This is a Quantitative Study study.

October 2, 2025Open Access

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Key Points

ScaleCap enhances image captioning accuracy by addressing multimodal bias and increasing output granularity.
Key metrics include performance gains across 11 benchmarks following LVLM pretraining with 450K annotated images.
The methodology involves heuristic question answering and contrastive sentence rating to enhance caption quality.
Findings highlight how additional inference resources directly improve the richness and fidelity of generated captions.

Abstract

This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper