What question did this study set out to answer?

This research evaluates the performance of large language models in multimodal metaphor understanding and generation.

May 6, 2026Open Access

Performances of LLMs in Multimodal Metaphor Understanding, Generation, Consistency and Creativity Based on FDPEF

Key Points

This research evaluates the performance of large language models in multimodal metaphor understanding and generation.
Developed a four-dimension progressive evaluation framework (FDPEF) based on cognitive linguistics and multimodal interaction.
Assessed the understanding, generative ability, consistency, and creativity of selected LLMs.
Evaluated joint representation of visual and textual modalities in metaphor parsing.
Claude-3-5 showed the best understanding ability, while Cici lagged due to over-abstraction.
ChatGPT-4 excelled in multimodal mapping but faced graphic semantic deviations.
ChatGPT-4 approached human-level metaphor comprehension but exhibited cognitive bias.
LLMs' creativity is hindered by path dependence and reliance on conventional metaphors.

Abstract

Human communication uses the synergistic interaction of multimodalities to express emotions and convey information in this age of rapid information science progress. More scholarly interest in multimodal research has also been sparked by the promotion of multimodal interaction methods; multimodal metaphor research is a novel line of inquiry that emerged from the fusion of interdisciplinary and multimodal discourse research. This study addresses the lack of systematic evaluation of large language models (LLMs) in understanding and generating multimodal metaphors by proposing a four-dimension progressive evaluation framework (FDPEF), based on cognitive linguistics theory and multimodal mechanism. The results indicate that Claude-3-5 leads in understanding ability while Cici is the weakest due to over-abstraction; in terms of generative ability, ChatGPT-4 demonstrates the optimal multimodal mapping logic, but none of the models can completely avoid the “graphic semantic deviation” problem; in terms of consistency, ChatGPT-4 is close to the human-level metaphor comprehension threshold, but still suffers from cognitive bias; and in terms of creativity, LLMs generally rely on the conventional metaphor paradigm, and their creativity is limited by the inherent cognitive framework of the training data. The study shows that LLMs can improve metaphor parsing accuracy through visual-textual joint representation, and can quantify metaphor parsing outcomes and their interpretive transformations into measurable metrics, while its metaphor generation is still limited by path dependence and insufficient understanding of cultural contexts, and needs to be optimized for metaphor controllability in the future by combining multimodal embedding and interpretable AI techniques.

Performances of LLMs in Multimodal Metaphor Understanding, Generation, Consistency and Creativity Based on FDPEF

Key Points

Abstract

Cite This Study