Existing video summarization methods predominantly produce generic summaries and often fail to reflect user-specific preferences. To address this limitation, we explore the potential of large language models (LLMs) for video summarization and propose two language-guided frameworks for personalized video summarization. We first propose Few-Shot Video SUMmarization (FS-VSUM), a non-trainable, example-driven framework that leverages LLM-based semantic reasoning to perform annotator-personalized video summarization. By conditioning on a small number of annotated examples, FS-VSUM captures annotator-specific summarization styles and generates customized summaries without parameter updates, demonstrating the inherent capability of LLMs for controllable and personalized video summarization. We then introduce Self-Supervised Video SUMmarization (SS-VSUM), a trainable framework that formulates video summarization as a semantic textual similarity task. SS-VSUM incorporates user preferences through LLM prompts and introduces a Preserving Diversity Loss (PDL) to dynamically regulate regularization based on linguistic diversity. We further extend SS-VSUM with additional analyses and clarifications, providing a more systematic understanding of language-guided video summarization. Experimental results show that SS-VSUM achieves state-of-the-art performance on the SumMe dataset. Together, this work provides a systematic investigation of language-guided video summarization, revealing how LLMs can support both training-free personalization and trainable performance optimization. The source code for the proposed frameworks is publicly available at https://github.com/sugitomoo/VSUM.
Sugihara et al. (Mon,) studied this question.