What question did this study set out to answer?

This work aims to enhance video summarization by making it more personalized and reflective of user preferences using language models.

July 1, 2026Open Access

Language-guided frameworks for personalized video summarization

Key Points

This work aims to enhance video summarization by making it more personalized and reflective of user preferences using language models.
Proposed Few-Shot Video SUMmarization (FS-VSUM), which uses LLMs without training to create personalized summaries.
Introduced Self-Supervised Video SUMmarization (SS-VSUM), a trainable framework that uses semantic textual similarity and user preferences for summarization.
Applied a Preserving Diversity Loss (PDL) to maintain linguistic diversity in video summaries.
SS-VSUM achieved state-of-the-art performance on the SumMe dataset, surpassing existing methods in personalization.
FS-VSUM effectively captures user-specific summarization styles without parameter updates, demonstrating the capabilities of LLMs.
Both frameworks provide evidence for the effectiveness of language-guided approaches in video summarization.

Abstract

Existing video summarization methods predominantly produce generic summaries and often fail to reflect user-specific preferences. To address this limitation, we explore the potential of large language models (LLMs) for video summarization and propose two language-guided frameworks for personalized video summarization. We first propose Few-Shot Video SUMmarization (FS-VSUM), a non-trainable, example-driven framework that leverages LLM-based semantic reasoning to perform annotator-personalized video summarization. By conditioning on a small number of annotated examples, FS-VSUM captures annotator-specific summarization styles and generates customized summaries without parameter updates, demonstrating the inherent capability of LLMs for controllable and personalized video summarization. We then introduce Self-Supervised Video SUMmarization (SS-VSUM), a trainable framework that formulates video summarization as a semantic textual similarity task. SS-VSUM incorporates user preferences through LLM prompts and introduces a Preserving Diversity Loss (PDL) to dynamically regulate regularization based on linguistic diversity. We further extend SS-VSUM with additional analyses and clarifications, providing a more systematic understanding of language-guided video summarization. Experimental results show that SS-VSUM achieves state-of-the-art performance on the SumMe dataset. Together, this work provides a systematic investigation of language-guided video summarization, revealing how LLMs can support both training-free personalization and trainable performance optimization. The source code for the proposed frameworks is publicly available at https://github.com/sugitomoo/VSUM.

Ask AI

Helpful

Bookmark

View Full Paper