Abstract Identifying disinformation in online social media is important for addressing several risks to society. However, methods for automatically detecting possible disinformation still mostly rely on low-level lexical cues, which are increasingly outdated in a world with access to generative language models. We present a new approach to the automatic detection of disinformation based on measuring discourse ‘derailment’ – messages that try to force the topic of discourse away from one topic and onto another. While this may include both malicious and benign derailment, it could serve as an important signal for early-warning systems. In this study, we implement a system for automatic detection of discourse derailment which uses a Large Language Model (LLM) to generate expected replies and compares them to the real replies. We test this system on a set of human-annotated data to show that the system outperforms various baselines and approaches the agreement between human annotators. This suggests that LLMs can be sensitive to discourse-level information. However, we also identify evidence of several limitations, including that the automatic system relies on different cues compared to human annotators, which leads to some amount of bias. Nevertheless, our project represents a considerable step towards understanding how to use LLMs to analyse discourse and a new angle on tackling disinformation.
Krykoniuk et al. (Fri,) studied this question.