Corpus pragmatics faces ongoing challenges in quantitatively studying context-dependent categories like humor, given their subjectivity and the need for costly inter-rater reliability checks. Recent advances in LLMs offer a potential way to streamline these processes for pragmatic annotation tasks. This paper investigates that potential through an analysis of Italian political discourse on X, focusing on humorous tweets and their discursive functions (Attardo, 2020). We compare the performance of GPT-4o, LLaMA-3.3-70B-Instruct, and a novice annotator against that of an expert annotator. For the detection of humor, both models reached high agreement with the expert annotator (in particular, GPT-4o: Cohen’s k = 0.75; AC1 = 0.87). Instead, agreement dropped for the classification of humor functions (GPT-4o: Cohen’s k = 0.37; AC1 = 0.70). Qualitative results suggest that the models rely heavily on lexical cues rather than demonstrating deeper pragmatic competence. These findings indicate that while LLMs can provide useful assistance in the initial stages of large-scale annotation, they remain limited in capturing the nuanced and context-dependent nature of pragmatic functions.
Bianco et al. (Sat,) studied this question.