The study examines whether large language models (LLMs) are capable of interpreting individual framings of depression in online forums - a challenging annotation task involving categories that are neither mutually exclusive nor trivial from a hermeneutic perspective. Using a corpus annotated by two annotators, the texts and categories that posed difficulties for human interpretation were identified, and the LLMs were prompted to report their confidence in their classifications. The study addresses the following research questions: whether LLMs can perform complex, meaning-making tasks comparable to human interpretation; how model and prompt choices affect accuracy and reliability; the extent to which model confidence in its own classifications correlates with classification accuracy; whether LLMs and humans struggle with the same texts; and how these findings can inform human–LLM collaboration in annotation process. Two models, GPT-4o mini (closed-source) and Llama 3.3 70B (open-source) were evaluated using zero-shot and few-shot prompting strategies. While both models showed strong reliability in terms of run-to-run, model-to-model and prompt-to-prompt comparisons, their overall accuracy was moderate. Performance was highest on texts that were also easier for human annotators. Notably, higher model confidence was often associated with greater accuracy, suggesting that confidence scores may be useful in supporting collaborative annotation systems. The findings highlight both the potential and the interpretive limits of LLMs in social science applications, underscoring the importance of careful annotation design and the thoughtful integration of LLMs into human-centered workflows. • Evaluated LLMs on a complex depression annotation task • Few-shot prompting yielded only marginal improvements over zero-shot • The LLM also struggled to label posts that were challenging for human annotators • Model confidence correlated positively with annotation accuracy • Model confidence could be used in supporting collaborative annotation systems
Fodor et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: