June 1, 2023

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Key Points

Key points are not available for this paper at this time.

Abstract

Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering bench-marks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Peng Jin

Jinfa Huang

Pengfei Xiong

Actions

Institutions

Tsinghua University

Peking University

Peng Cheng Laboratory

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Jin et al. (Thu,) studied this question.

www.synapsesocial.com/papers/6a0e95818967b8cf44045135 — DOI: https://doi.org/10.1109/cvpr52729.2023.00244

Also consider

Synapse has enriched 3 closely related papers on similar clinical questions. Consider them for comparative context:

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval· 2022 · 125 citations
Fine-Grained Semantically Aligned Vision-Language Pre-Training· 2022 · 29 citations
COMPUTATION OF POWER INDICES· 2002 · 39 citations

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider