Key points are not available for this paper at this time.
Video summarization aims to produce a short yet informative summary of a long video while reducing the amount of redundancy. Most transformer-based methods are single-temporal scale or are unconcerned with shot-level structure, limiting temporal coherence and cross-dataset generalization. To fill these gaps, we present HybridHiT-UNet, a supervised framework that combines three complementary parts: a pretrained Vision Transformer encoder to provide spatially rich frame representations, a multi-scale 1D Temporal U-Net backbone to provide hierarchical temporal modeling of frame representations, and a shot-aware hierarchical transformer scoring head to provide inter-shot context to importance prediction. Frame-level scores are summed into shot-level utilities and optimized with a knapsack selection on a fixed-length budget, and a weighted focal loss is used to address extreme class imbalance. Wide experiments using four benchmarks (SumMe, TVSum, OVP, and YouTube) under canonical, augmented, and transfer protocols demonstrate that HybridHiT-UNet achieves F1-scores of 65.8% on SumMe and 79.92% on TVSum, which is higher than the existing methods, which still achieve diversity scores of 64.98% and 48.68%, respectively. A systematic study further demonstrates that a 20% summary budget would yield a consistently superior coverage-diversity trade-off than the traditional 15% one, which provides useful evidence-based advice on the selection of summary length.
Sakib et al. (Wed,) studied this question.