With the proliferation of multimedia data, controllable summarization generation has become a key focus in Artificial Intelligence Content Generation. However, many traditional methods lack precise control over output length, often resulting in summaries that are either too verbose or too brief, thus failing to meet diverse user needs. In this paper, we propose a length-customizable approach for multimodal image-text summarization. Our method integrates combinatorial optimization with deep learning to address the length-control challenge. Specifically, we formulate the summarization task as a knapsack optimization problem, enhanced by a greedy algorithm to strictly adhere to user-defined length constraints. Additionally, we introduce a multimodal attention mechanism to ensure balanced and coherent integration of textual and visual information. To further enhance semantic alignment, we employ a cross-modal matching strategy for image selection based on pre-trained vision-language models. Experimental evaluations on the MSMO dataset and validate against baselines like LEAD-3, Seq2Seq, Attention, and Transformer that our method achieves a ROUGE-1 score of 40.52, ROUGE-2 of 16.07, and ROUGE-L of 35.15, outperforming existing length-controllable baselines. Moreover, our approach attains the lowest length variance, confirming its precise adherence to target summary lengths. These results validate the effectiveness of our method in generating high-quality, length-constrained multimodal summaries.
Liu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: