Abstract Traditional Mechanistic Interpretability faces the geometric hurdle of polysemanticity and Superposition: the idea of assigning a static semantic label to individual dimensions of the Latent Space has proven inadequate. In this paper, we propose a working hypothesis to overcome the observational paradigm in favor of an interventionist and causal approach. We theoretically introduce the Directional Similarity Gradient, a relational metric that conceptualizes neural meaning not as a fixed coordinate, but as the rate of change of the local similarity profile following a targeted perturbation along a specific axis of the Residual Stream. We hypothesize that by replacing Euclidean distance with variations in Cosine Similarity, it is possible to isolate the directional component of meaning, classifying the latent vocabulary into semantic "Attractors" and "Repulsors". This approach would offer a potential tool for dynamically disentangling polysemantic dimensions depending on the original context vector. Finally, we explore the implications of this framework for AI Alignment. By utilizing Contrastive Extraction to isolate the latent directions of specific tasks and malicious concepts, we theorize a Danger Index (I₃₀₍₆₄ₑ) that could probe the propensity for structural Deception even before it manifests in the output. Although requiring rigorous empirical validation in the field, this theoretical approach suggests that the meaning of individual dimensions of the Latent Space is intrinsically a potential for movement, outlining a new perspective for the geometric auditing of frontier models.
Roberto Matarazzo (Sun,) studied this question.