What question did this study set out to answer?

The research aims to enhance video moment retrieval by improving long-range temporal context capture.

April 5, 2026Open Access

Mamba-based modulated fusion model for video moment retrieval

Key Points

The research aims to enhance video moment retrieval by improving long-range temporal context capture.
Implemented a two-level fusion architecture called Hybrid Mamba Network (HM-Net)
Utilized Hybrid Modulated Bi-Mamba Block for enhanced temporal modeling
Conducted experiments on TACoS and QVHighlights benchmarks to evaluate performance
Achieved 3.84% improvement in R1@0.5 on TACoS benchmark
Obtained 1.65% increase in mAP on QVHighlights
Demonstrated enhanced localization accuracy in long-form videos

Abstract

Video Moment Retrieval (VMR) serves as a fundamental task in video understanding, bridging vision and language by localizing the most relevant temporal segments in untrimmed videos according to a textual query. However, existing approaches excel at fine-grained alignment but often fail to capture global temporal context effectively, particularly in long-form videos. To address this challenge, we propose Hybrid Mamba Network (HM-Net), a two-level fusion architecture which unifying the strengths of attention and sequence modeling. Especially, its core lies in the Hybrid Modulated Bi-Mamba (HMB) Block, which integrates the powerful temporal modeling capability of Mamba into the VMR framework to achieve effective long-range temporal reasoning. Extensive experiments on the challenging TACoS and QVHighlights benchmarks show that HM-Net consistently outperforms existing approaches, achieving 3.84% improvement in R1@0.5 (TACoS) and 1.65% in mAP (QVHighlights), demonstrating notable gains in localization accuracy, particularly on long-form videos.

Bookmark

View Full Paper

Cite This Study

Yu et al. (Fri,) studied this question.

synapsesocial.com/papers/69d1fc4fa79560c99a0a1db8 https://doi.org/https://doi.org/10.1038/s41598-026-44804-x

Bookmark

View Full Paper