Children increasingly consume online video content, creating a growing need for scalable approaches to support content moderation workflows. However, directly identifying harmful or policy-violating content, such as violence, sexual content, or self-harm, remains a complex task that typically requires specialized classifiers and domain-specific annotations. In this context, sentiment analysis can provide complementary information by capturing affective signals expressed through language and visual cues. This study does not treat sentiment polarity as a direct indicator of unsafe or policy-violating content. Instead, it explores multimodal sentiment analysis as an auxiliary triage signal that may help prioritize content for human review or identify segments requiring further inspection. This paper investigates the feasibility of using large vision–language models (LVLMs) for zero-shot multimodal sentiment analysis on utterance-aligned video segments. We evaluate two LVLMs, LLaVA-OneVision-7B and Qwen2.5-VL-7B, under three input settings: text-only, vision-only, and multimodal, using a conversational TV-series dataset consisting of short utterance-level video segments and transcripts. The results show that multimodal sentiment inference can provide useful screening signals without task-specific fine-tuning, although the benefits are model-dependent. LLaVA-OneVision-7B consistently outperforms Qwen2.5-VL-7B and benefits more clearly from combining textual and visual inputs, whereas Qwen2.5-VL-7B shows limited improvement across modality settings. We also analyze the trade-off between frame sampling and image resolution. Finally, we discuss limitations related to dataset scope, annotation subjectivity, class imbalance, and the need for broader validation before real-world deployment.
Hanafiah et al. (Sat,) studied this question.