What does this research mean for the field?

The Directional Similarity Gradient can dynamically disentangle polysemantic dimensions in neural latent spaces, classifying them into semantic 'Attractors' and 'Repulsors'. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.ESTABLISHES_NEW_DIRECTION.

What question did this study set out to answer?

This research aims to redefine how we interpret latent meanings in AI models by introducing new metrics that account for polysemanticity.

March 10, 2026Open Access

Beyond Observation: Defining Latent Meaning via Directional Similarity Gradients

Key Points

This research aims to redefine how we interpret latent meanings in AI models by introducing new metrics that account for polysemanticity.
Theoretically introduce the Directional Similarity Gradient as a new relational metric.
Replace traditional Euclidean distance with variations in Cosine Similarity.
Classify the latent vocabulary into semantic 'Attractors' and 'Repulsors'.
Explore implications for AI Alignment and structural deception.
Suggests that latent dimensions can be understood as potential for movement rather than static labels.
Proposes the Danger Index (I_{danger}) to measure propensity for deception in AI outputs.
Highlights a framework for analyzing the geometric structure of AI models.

Abstract

Abstract Traditional Mechanistic Interpretability faces the geometric hurdle of polysemanticity and Superposition: the idea of assigning a static semantic label to individual dimensions of the Latent Space has proven inadequate. In this paper, we propose a working hypothesis to overcome the observational paradigm in favor of an interventionist and causal approach. We theoretically introduce the Directional Similarity Gradient, a relational metric that conceptualizes neural meaning not as a fixed coordinate, but as the rate of change of the local similarity profile following a targeted perturbation along a specific axis of the Residual Stream. We hypothesize that by replacing Euclidean distance with variations in Cosine Similarity, it is possible to isolate the directional component of meaning, classifying the latent vocabulary into semantic "Attractors" and "Repulsors". This approach would offer a potential tool for dynamically disentangling polysemantic dimensions depending on the original context vector. Finally, we explore the implications of this framework for AI Alignment. By utilizing Contrastive Extraction to isolate the latent directions of specific tasks and malicious concepts, we theorize a Danger Index (I₃₀₍₆₄ₑ) that could probe the propensity for structural Deception even before it manifests in the output. Although requiring rigorous empirical validation in the field, this theoretical approach suggests that the meaning of individual dimensions of the Latent Space is intrinsically a potential for movement, outlining a new perspective for the geometric auditing of frontier models.

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper