October 26, 2024Open Access

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

Key Points

Key points are not available for this paper at this time.

Abstract

In this paper, we explore the use of large language models (LLMs) to enhance video moment retrieval (VMR) by integrating general knowledge and pseudo-events as priors. We address the limitations of LLMs in generating continuous outputs, such as salience scores and inter-frame embeddings, which are critical for capturing inter-frame relations. To address these limitations, we propose using LLM encoders, which refine inter-concept relations in multimodal embeddings effectively, even without textual training. Our feasibility study shows that this capability extends to other embeddings like BLIP and T5 when they exhibit similar patterns to CLIP embeddings. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module. The LLM encoder's ability to refine concept relation can help the model to achieve a balanced understanding of the foreground concepts (e.g., persons, faces) and background concepts (e.g., street, mountains) rather focusing only on the visually dominant foreground concepts. Additionally, we utilize pseudo-events, identified via event detection, to guide accurate moment prediction within event boundaries, reducing distractions from adjacent moments. Our plug-in approach for semantic refinement and pseudo-event regulation demonstrates state-of-the-art VMR performance through experimental validation. The source code can be accessed at https://github.com/fletcherjiang/LLMEPET.

KI fragen

Bookmark

View Full Paper

Cite This Study

Jiang et al. (Sat,) studied this question.

synapsesocial.com/papers/6a10c5bed06b5b96589f7ce7 https://doi.org/https://doi.org/10.1145/3664647.3681115

KI fragen

Bookmark

View Full Paper