Modern information-analytical processing systems operating within the «human – information» paradigm inevitably encounter limitations driven by the cognitive constraints of an analyst’s memory. The integration of generative language models based on the Transformer architecture represents a significant step toward enhancing the automation of information processing. However, phenomena such as generative confabulation (hallucinations) and the limited size of the contextual window inherent in these models lead to potential distortions of factual information, resulting in reduced reliability of generated outputs. The objective of the article is to investigate the nature of hallucination effects in autoregressive language models and identify robust informative features for detecting and potentially regulating factual distortions in the latent space of neural networks. The model of an autoregressive neural network has been developed, accounting for knowledge obsolescence and the superposition of multilayer perceptron parameters, enabling analysis of the relationship between model parameters and the distributions of generated tokens. It has been found that the first eigenvector of the spectral decomposition of the difference between covariance matrices of the final layer serves as the most stable discriminative feature of hallucination. It has been demonstrated that manipulating activations along this direction can reduce factual distortions and control the semantics of the output text. The developed feature can be integrated into tools for monitoring and controlling text generation to automatically detect and correct factual distortions during inference. It has the potential to be extended to any abstract concepts, paving the way for more flexible and reliable semantic control of large language models.
Vladislav Ivanov (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: