August 17, 2025

From higher to lower: A guidance-propagation hierarchical attention for video captioning

Key Points

The proposed G2L framework notably enhances video captioning by refining region-level context for accurate captions.
G2L achieved significant improvements on MSVD and MSR-VTT datasets, validating its effectiveness in capturing semantic elements.
Differentiable Gumbel Top-K sampling was used to select salient clips and frames, aiding in better caption generation.
The global-to-local cascade and dual-branch optimization contributed to the overall performance gains of the model.

Abstract

In recent years, the task of generating captions for videos has become a prominent research focus, with its main challenge being how to effectively capture essential semantic elements – such as objects, actions, and their spatial-temporal relationships – from abundant and redundant visual content. To address this challenge, earlier methods generally concentrate on either extracting representative clips across multiple frames (global level) or locating salient areas within single frames (local level). Many existing methods tend to ignore the fundamental hierarchical organization of videos, where identifying representative frames should come before locating informative regions. To tackle this limitation, we propose G2L, a hierarchical attention framework that (1) selects salient clips & frames via differentiable Gumbel Top-K sampling and (2) refines region-level context for caption generation. Extensive experiments conducted on the widely adopted benchmarks MSVD and MSR-VTT confirm that our method achieves notable improvements over existing state-of-the-art approaches. Ablations confirm that the global-to-local cascade and dual-branch optimization jointly account for the gain.

Bookmark

From higher to lower: A guidance-propagation hierarchical attention for video captioning

Key Points

Abstract

Cite This Study

Also Consider

Also Consider