Real-world multimodal sentiment analysis often suffers from a paradoxical bottleneck: text is the dominant signal carrier, yet it is also the easiest channel to be hijacked by spurious co-occurrences that trigger shortcut learning and lead to severe out-of-distribution degradation. We propose GCR-TE, a text-enhancement framework for multimodal sentiment learning that combines multimodal grounding with generative counterfactual rewriting. First, GCR-TE builds sentiment evidence anchors from fused audio-visual-text representations and attribution-based consistency scores, separating emotion-bearing spans from suspicious co-occurrence spans in the transcript. Second, a grounded counterfactual rewriter generates a family of label-preserving textual variants by controllably substituting, paraphrasing, and style-perturbing only the suspicious spans while keeping the anchored sentiment cues and their semantic roles intact. Finally, we train the model with a triple objective: (i) counterfactual invariance to suppress shortcut features, (ii) anchor-focused attention gain to selectively amplify reliable sentiment cues, and (iii) cross-modal alignment regularization to ensure semantic consistency between augmented text and the original audio-visual evidence. Without requiring extra human annotations, GCR-TE simultaneously improves robustness against spurious correlations and enhances the effective utilization of sentiment-bearing words, offering a controllable and interpretable text-enhanced paradigm for multimodal sentiment analysis.
Ji et al. (Wed,) studied this question.