March 3, 2025

Mechanisms of emergence and suppression of factual distortions in autoregressive language models

Key Points

Hallucination effects can significantly distort factual information generated by autoregressive language models, reducing their reliability.
The study found that the first eigenvector of covariance matrices serves as a stable feature for detecting hallucinations in neural networks.
This approach involves analyzing the autoregressive model and its parameters to identify and modify instances of factual distortion.
Integrating these findings into language models may improve semantic control, opening new avenues for reliable text generation.

Abstract

Modern information-analytical processing systems operating within the «human – information» paradigm inevitably encounter limitations driven by the cognitive constraints of an analyst’s memory. The integration of generative language models based on the Transformer architecture represents a significant step toward enhancing the automation of information processing. However, phenomena such as generative confabulation (hallucinations) and the limited size of the contextual window inherent in these models lead to potential distortions of factual information, resulting in reduced reliability of generated outputs. The objective of the article is to investigate the nature of hallucination effects in autoregressive language models and identify robust informative features for detecting and potentially regulating factual distortions in the latent space of neural networks. The model of an autoregressive neural network has been developed, accounting for knowledge obsolescence and the superposition of multilayer perceptron parameters, enabling analysis of the relationship between model parameters and the distributions of generated tokens. It has been found that the first eigenvector of the spectral decomposition of the difference between covariance matrices of the final layer serves as the most stable discriminative feature of hallucination. It has been demonstrated that manipulating activations along this direction can reduce factual distortions and control the semantics of the output text. The developed feature can be integrated into tools for monitoring and controlling text generation to automatically detect and correct factual distortions during inference. It has the potential to be extended to any abstract concepts, paving the way for more flexible and reliable semantic control of large language models.

AIに質問

Bookmark