What does this research mean for the field?

While large language models demonstrate strong reliability in complex hermeneutic text annotation tasks, their overall accuracy is moderate, they struggle with the same ambiguous texts as human annotators, and their self-reported confidence positively correlates with annotation accuracy. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The research aims to determine if large language models can effectively interpret and annotate complex texts regarding depression from online forums.

May 17, 2026Open Access

Evaluating Large Language Models on a Hermeneutically Complex Text Annotation Task

Key Points

The research aims to determine if large language models can effectively interpret and annotate complex texts regarding depression from online forums.
Evaluated GPT-4o mini and Llama 3.3 70B models using zero-shot and few-shot prompting strategies.
Utilized a corpus annotated by human annotators to assess challenges in text interpretation.
Assessed the correlation between model confidence and classification accuracy.
Both models showed moderate overall accuracy but high reliability across different comparisons.
Easier texts for human annotators corresponded to better LLM performance.
Higher model confidence frequently aligned with greater annotation accuracy.

Abstract

The study examines whether large language models (LLMs) are capable of interpreting individual framings of depression in online forums - a challenging annotation task involving categories that are neither mutually exclusive nor trivial from a hermeneutic perspective. Using a corpus annotated by two annotators, the texts and categories that posed difficulties for human interpretation were identified, and the LLMs were prompted to report their confidence in their classifications. The study addresses the following research questions: whether LLMs can perform complex, meaning-making tasks comparable to human interpretation; how model and prompt choices affect accuracy and reliability; the extent to which model confidence in its own classifications correlates with classification accuracy; whether LLMs and humans struggle with the same texts; and how these findings can inform human–LLM collaboration in annotation process. Two models, GPT-4o mini (closed-source) and Llama 3.3 70B (open-source) were evaluated using zero-shot and few-shot prompting strategies. While both models showed strong reliability in terms of run-to-run, model-to-model and prompt-to-prompt comparisons, their overall accuracy was moderate. Performance was highest on texts that were also easier for human annotators. Notably, higher model confidence was often associated with greater accuracy, suggesting that confidence scores may be useful in supporting collaborative annotation systems. The findings highlight both the potential and the interpretive limits of LLMs in social science applications, underscoring the importance of careful annotation design and the thoughtful integration of LLMs into human-centered workflows. • Evaluated LLMs on a complex depression annotation task • Few-shot prompting yielded only marginal improvements over zero-shot • The LLM also struggled to label posts that were challenging for human annotators • Model confidence correlated positively with annotation accuracy • Model confidence could be used in supporting collaborative annotation systems

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper