What question did this study set out to answer?

The research aims to evaluate the use of large vision–language models for zero-shot multimodal sentiment analysis to aid in video content moderation.

May 20, 2026Open Access

Zero-Shot Multimodal Sentiment Analysis Using LVLMs as a Triage Signal for Video Platform Moderation

Key Points

The research aims to evaluate the use of large vision–language models for zero-shot multimodal sentiment analysis to aid in video content moderation.
Evaluated two LVLMs: LLaVA-OneVision-7B and Qwen2.5-VL-7B.
Analyzed inputs in three settings: text-only, vision-only, and multimodal.
Used a conversational TV-series dataset with short utterance-level video segments and transcripts.
LLaVA-OneVision-7B outperformed Qwen2.5-VL-7B, particularly in multimodal settings.
Multimodal sentiment inference provided useful screening signals without task-specific fine-tuning.
Findings highlighted trade-offs between frame sampling and image resolution.

Abstract

Children increasingly consume online video content, creating a growing need for scalable approaches to support content moderation workflows. However, directly identifying harmful or policy-violating content, such as violence, sexual content, or self-harm, remains a complex task that typically requires specialized classifiers and domain-specific annotations. In this context, sentiment analysis can provide complementary information by capturing affective signals expressed through language and visual cues. This study does not treat sentiment polarity as a direct indicator of unsafe or policy-violating content. Instead, it explores multimodal sentiment analysis as an auxiliary triage signal that may help prioritize content for human review or identify segments requiring further inspection. This paper investigates the feasibility of using large vision–language models (LVLMs) for zero-shot multimodal sentiment analysis on utterance-aligned video segments. We evaluate two LVLMs, LLaVA-OneVision-7B and Qwen2.5-VL-7B, under three input settings: text-only, vision-only, and multimodal, using a conversational TV-series dataset consisting of short utterance-level video segments and transcripts. The results show that multimodal sentiment inference can provide useful screening signals without task-specific fine-tuning, although the benefits are model-dependent. LLaVA-OneVision-7B consistently outperforms Qwen2.5-VL-7B and benefits more clearly from combining textual and visual inputs, whereas Qwen2.5-VL-7B shows limited improvement across modality settings. We also analyze the trade-off between frame sampling and image resolution. Finally, we discuss limitations related to dataset scope, annotation subjectivity, class imbalance, and the need for broader validation before real-world deployment.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper