The ability to condense videos and extract essential information has become increasingly important with the continuous rise of online video content. Video summarization methods strive to condense lengthy videos into concise representations while preserving diverse context and representative visual content. However, existing techniques often struggle to capture the complex temporal dependencies, key moments and contextual details within videos, resulting in suboptimal summarization performance. Moreover, existing methods place insufficient emphasis on selecting contextually relevant keyframes with diverse visual content. To address this limitation, this study proposes a Correlation Attention (CA) model specifically designed for semantic video summarization tasks. The proposed correlation attention strengthens semantic understanding by modeling inter-frame dependencies, allowing the summarizer to focus on frames that contribute most to the overall narrative. Integrating the Correlation Attention model with a stacked Bidirectional Long Short-Term Memory (BiLSTM) neural network architecture demonstrates substantial improvements in summarization performance compared to existing methods across diverse video datasets. Deep Video Summarization (DVS), TVSum and SumMe video datasets are used for training, validation and testing of the model. Quantitative analysis metrics such as accuracy (0.873), F-score (0.623) and average fidelity (0.968) show better performance of the model compared to the state of the art video summarization model. The findings underscore the efficacy of correlation model in enhancing semantic video summarization, promising advancements in video content analysis, retrieval, compression and applications requiring efficient video understanding. This research has potential benefits for a wide range of stakeholders, including web users, information seekers, content creators and educational institutions. • A Correlation Attention model for generating contextually meaningful video summaries. • A hybrid architecture that learn spatio-temporal features and long-term dependency. • Experiments on TVSum and SumMe datasets show superior video summary results.
K.C. et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: