What question did this study set out to answer?

To assess how well different language models detect sarcasm across multiple languages, focusing on cultural influences.

February 28, 2026Open Access

Evaluating the capability of base and large-scale language models for multilingual sarcasm detection

Key Points

To assess how well different language models detect sarcasm across multiple languages, focusing on cultural influences.
Used base-scale pre-trained models like BERT and RoBERTa, and large language models like GPT-4.
Conducted few-shot fine-tuning on sarcasm detection tasks.
Evaluated performance in English, Spanish, and Amharic, including monolingual benchmarks.
RoBERTa-base achieved the highest multilingual generalization with an F1 score of 0.82.
BERT outperformed in English with an F1 score of 0.90.
GPT-4 showed limitations in sarcasm comprehension, scoring 0.65, despite better language interpretation capabilities.

Abstract

Even though natural language understanding has made significant progress, language models still struggle to grasp sarcasm, a complex linguistic phenomenon that is influenced by cultural and contextual differences. This has even become much worse in multilingual settings. This study assesses the efficacy of base-scale pre-trained models (Bidirectional Encoder Representations from Transformers (BERT), Robustly Optimized BERT Pretraining Approach (RoBERTa), Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa), and Distilled version of BERT (DistilBERT)) via task-specific fine-tuning, and large language models (LLMs) (GPT-4) in few-shot contexts, across three languages: English, Spanish, and Amharic. While we primarily focus on multilingual sarcasm detection, we also offer monolingual benchmarks to evaluate language-specific adaptations. Among fine-tuned models, RoBERTa-base has gained the highest multilingual generalization (F1: 0.82), while BERT outperforms in English (F1: 0.90), proving the English language adaptability in models. On the other hand, GPT-4o with a few-shot strategy has shown a limitation on sarcasm comprehension (F1: 0.65), even though it is better at interpreting language. This indicates that, although LLMs exhibit greater flexibility, base-scale models refined on task-specific data remain superior in detecting multilingual sarcasm. Finally, we believe this work gives useful tips for choosing a model when resources are limited and shows how important it is to have sarcasm detection systems that can adapt to different cultures.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Girma Yohannis Bade

Olga Kolesnikova

José Luis Oropeza

Journals

PeerJ Computer Science

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating the capability of base and large-scale language models for multilingual sarcasm detection

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study