Human communication uses the synergistic interaction of multimodalities to express emotions and convey information in this age of rapid information science progress. More scholarly interest in multimodal research has also been sparked by the promotion of multimodal interaction methods; multimodal metaphor research is a novel line of inquiry that emerged from the fusion of interdisciplinary and multimodal discourse research. This study addresses the lack of systematic evaluation of large language models (LLMs) in understanding and generating multimodal metaphors by proposing a four-dimension progressive evaluation framework (FDPEF), based on cognitive linguistics theory and multimodal mechanism. The results indicate that Claude-3-5 leads in understanding ability while Cici is the weakest due to over-abstraction; in terms of generative ability, ChatGPT-4 demonstrates the optimal multimodal mapping logic, but none of the models can completely avoid the “graphic semantic deviation” problem; in terms of consistency, ChatGPT-4 is close to the human-level metaphor comprehension threshold, but still suffers from cognitive bias; and in terms of creativity, LLMs generally rely on the conventional metaphor paradigm, and their creativity is limited by the inherent cognitive framework of the training data. The study shows that LLMs can improve metaphor parsing accuracy through visual-textual joint representation, and can quantify metaphor parsing outcomes and their interpretive transformations into measurable metrics, while its metaphor generation is still limited by path dependence and insufficient understanding of cultural contexts, and needs to be optimized for metaphor controllability in the future by combining multimodal embedding and interpretable AI techniques.
Zhong Yuke (Fri,) studied this question.