Los puntos clave no están disponibles para este artículo en este momento.
Abstract Identifying relevant moments in video content using natural language queries is a challenging task, especially in the era of exponential growth in video content. Existing solutions need to detect relevant moments by capturing the global context of videos and surpassing local temporal correlations. In this paper, we propose a novel and comprehensive framework for video content retrieval and moment detection based on natural language queries. Our framework centers on a novel temporal context modeling approach that captures and exploits long-range dependencies and contextual information across different time scales. We also integrate a specialized encoder-decoder model with multi-modal intent to extract intricate temporal patterns and context from video data and understand the user's intent. To address the challenge of temporal variance, a prominent obstacle in moment retrieval, we harness the power of our temporal context modeling. Our approach handles these changes, resulting in more accurate content retrieval and analysis. We conducted a comprehensive comparative study on available datasets using recall (R@1, R@5) emphasize ranking accuracy for top highlights and mean average precision (map) assesses overall ranking quality, prioritizing relevance to the query across multiple recall levels. Our experimental findings show the superior performance of the proposed framework compared to existing methods. Incorporating the multi-modal intent technique and the innovative temporal context modeling enhance the ability to identify relevant moments and highlights. Our proposed framework is a novel approach to video content retrieval and moment detection that has the potential to revolutionize the way we interact with video content.
Singh et al. (Fri,) studied this question.