What question did this study set out to answer?

The aim is to improve multimodal sentiment analysis by addressing shortcut learning and enhancing text reliability.

April 25, 2026Open Access

GCR-TE: Grounded generative counterfactual rewriting for text enhanced multimodal sentiment analysis

Key Points

The aim is to improve multimodal sentiment analysis by addressing shortcut learning and enhancing text reliability.
Developed GCR-TE framework combining multimodal grounding with generative counterfactual rewriting.
Utilized attribution-based consistency scores to differentiate emotion-bearing text from spurious signals.
Trained the model with objectives focusing on counterfactual invariance and cross-modal alignment.
GCR-TE improved robustness against spurious correlations in sentiment analysis.
Enhanced effective utilization of sentiment-bearing words in multimodal contexts.
Achieved text-enhanced sentiment analysis without additional human annotations.

Abstract

Real-world multimodal sentiment analysis often suffers from a paradoxical bottleneck: text is the dominant signal carrier, yet it is also the easiest channel to be hijacked by spurious co-occurrences that trigger shortcut learning and lead to severe out-of-distribution degradation. We propose GCR-TE, a text-enhancement framework for multimodal sentiment learning that combines multimodal grounding with generative counterfactual rewriting. First, GCR-TE builds sentiment evidence anchors from fused audio-visual-text representations and attribution-based consistency scores, separating emotion-bearing spans from suspicious co-occurrence spans in the transcript. Second, a grounded counterfactual rewriter generates a family of label-preserving textual variants by controllably substituting, paraphrasing, and style-perturbing only the suspicious spans while keeping the anchored sentiment cues and their semantic roles intact. Finally, we train the model with a triple objective: (i) counterfactual invariance to suppress shortcut features, (ii) anchor-focused attention gain to selectively amplify reliable sentiment cues, and (iii) cross-modal alignment regularization to ensure semantic consistency between augmented text and the original audio-visual evidence. Without requiring extra human annotations, GCR-TE simultaneously improves robustness against spurious correlations and enhances the effective utilization of sentiment-bearing words, offering a controllable and interpretable text-enhanced paradigm for multimodal sentiment analysis.

GCR-TE: Grounded generative counterfactual rewriting for text enhanced multimodal sentiment analysis

Key Points

Abstract

Cite This Study