What question did this study set out to answer?

The study aims to identify and characterize Degenerate Equilibrium (DegEq) in a hyperbolic distillation context using the HyDRA model.

April 12, 2026Open Access

HyDRA: Hyperbolic Distillation with Riemannian Adaptation

Key Points

The study aims to identify and characterize Degenerate Equilibrium (DegEq) in a hyperbolic distillation context using the HyDRA model.
Utilized a compact autoregressive model with ≈12M parameters on Lorentz hyperboloid H¹²⁸.
Implemented knowledge distillation from a frozen GPT-2-small model (117M parameters).
Conducted experiments with various loss-side interventions to analyze the effects on DegEq.
Achieved a best validation perplexity of 282.0 on WikiText-2.
Established that Channel 2 is the dominant cause of DegEq, leading to a 91% reduction in the Radial Drift Coefficient (rdc*).
Found that no intervention eliminated DegEq, indicating its inherent properties within the model architecture.

Abstract

HyDRA v4: Hyperbolic Distillation with Riemannian Adaptation— Channel Attribution and Proof by Elimination HyDRA is a compact autoregressive language model (≈12M parameters) thatoperates natively on the Lorentz hyperboloid H¹²⁸ and is trained viaknowledge distillation from a frozen GPT-2-small teacher (117M parameters). Every layer — attention, feed-forward, and residual connections — preservesthe Riemannian manifold constraint to float64 precision. Training onWikiText-2 achieves a best validation perplexity of 282. 0 (Variant F, 44, 000 steps), with Minkowski constraint violations below 10⁻⁶ throughout. ─────────────────────────────────────────────────────────────MAIN FINDING: Channel 2 (LM Head) is the Dominant Cause of DegEq───────────────────────────────────────────────────────────── We identify, characterise, and causally attribute Degenerate Equilibrium (DegEq): a stable fixed point of KL-based hyperbolic distillation in whichangular alignment stabilises while radial dynamics remain active, yielding ageometrically valid but semantically degraded configuration. Five from-scratch experiments spanning the complete space of loss-sideinterventions — standard KL (Variant F), Projective KL (D1), DecoupledRadial-Angular (D3), Origin-Tangent Euclidean Distillation (OTED), and OTEDwith radial anchor (V5-D) — all converge to the same fixed point (rdc* ≈ 10, relative deviation <5%). This constitutes a proof by exhaustion that noloss-side intervention prevents DegEq. A complete 2×2 channel-isolation matrix (V5) then surgically attributes theattractor to its architectural source: V5-D (no fix): rdc* = 10. 74 — DegEq baseline V5-B (Ch2 fix only): rdc* = 0. 96 — 91% reduction ★ V5-A (Ch1 fix only): RadiusCollapse — Ch1 alone is unstable V5-C (both channels): NaN explosion — numerical instability V5-B activates only AngularLMHead (Channel 2 fix), leaving the optimizerunchanged. The result is a 91% reduction in rdc*, establishing Channel 2— the LM head radial gradient ∂logit/∂r = cosh (r) ≠ 0 — as the dominantand sufficient cause of DegEq. Channel 2 alone is necessary and sufficientto eliminate the attractor. ─────────────────────────────────────────────────────────────MECHANISTIC EXPLANATION───────────────────────────────────────────────────────────── Two independent radial channels drive DegEq: Channel 1 (optimizer): First-order parallel transport of AdamW momentumaccumulates a radially biased approximation error εₜ ∝ xₜ via theChristoffel symbol Γʳ_θθ = −sinh (r) cosh (r). Delays DegEq onset but cannotneutralise the attractor. When applied without Channel 2, triggersRadiusCollapse — a distinct pathology distinct from DegEq. Channel 2 (LM head): The vocabulary projection computes∂logitₖ/∂r = cosh (rₕ) ≠ 0, injecting radial gradient that bypasses anyloss-side or optimizer-side intervention. This is the dominant channel: zeroing it via AngularLMHead eliminates 91% of the attractor value. Additionally, the Krioukov (2010) curvature–Zipf equilibrium predictsK* = 1/ (4 (γ−1) ²) = 66. 4 for WikiText-2 (γ ≈ 1. 06), versus the model'sfixed K = 1 — a mismatch of 65. 4 units that may explain the specificattractor value rdc* ≈ 10 as a thermodynamic equilibrium between manifoldcurvature and corpus statistics. ─────────────────────────────────────────────────────────────STRUCTURAL CONTRIBUTIONS───────────────────────────────────────────────────────────── (1) Radial Drift Coefficient (RDC). Real-time diagnostic proxy RDC = σₗogit / (Lₕidden + ε), EMA β = 0. 95, with Lyapunov potential Lq = ½·rdc². Predicts DegEq onset 500–1, 000 steps in advance. (2) Riemannian Natural Gradient Correction. r/sinh (r) scaling of AdamW updates on manifold parameters (Amari, 1998). Delays DegEq onset from step ≈5, 400 to beyond step 33, 400 in the extended Variant F run. (3) AngularLMHead. Cosine-similarity vocabulary head with ∂logit/∂r = 0 exact. Eliminates Channel 2 — the dominant DegEq source. V5-B result: rdc* 10. 74 → 0. 96 (91% reduction). (4) EarlyStoppingV3. Dual-EMA stopping (fast β = 0. 3, slow β = 0. 9) with detrended noise estimation. Eliminates false positives on true loss plateaus where single-reference EMA fires spuriously. (5) Origin-Tangent Euclidean Distillation (OTED). All objectives computed in Tₒ Hⁿ ≅ ℝⁿ, eliminating Christoffel symbols from the backward pass entirely. Reaches rdc* = 10. 67 — confirming loss geometry is not the causal channel. (6) cgt. diagnostics. Post-training DegEq analysis module: Krioukov K* equilibrium (kₑquilibriumfromᵦipf), Khrulkov frequency–radius correlation (freqᵣadiuscorrelation), and DegEqDiagnostics unified report. Purely additive — no training code modified. ─────────────────────────────────────────────────────────────GEOMETRIC–LINGUISTIC DECOUPLING (Negative Result) ───────────────────────────────────────────────────────────── Geometric fidelity is neither sufficient nor predictive of linguisticcompetence. Despite 9/10 geometry tests passing and Minkowski violationsbelow 10⁻⁶ throughout 44, 000 steps, generated text remains incoherent. Good geometry is cheap — enforcing Riemannian correctness underarchitectural constraints requires no special effort — but does not implymeaningful representation learning. ─────────────────────────────────────────────────────────────LIMITATIONS───────────────────────────────────────────────────────────── Single-seed results (SEED=42). DegEq characterisation is empirical andrestricted to the tested architecture (4L×128d) and dataset (WikiText-2). V5-C (Ch1+Ch2) suffered NaN explosion at step 13 — a numerical instabilityin the Ch1+Ch2+OTED interaction, distinct from DegEq and unresolved. The Krioukov K* prediction (learnable K shifting rdc* to a data-dependentfixed point) remains falsifiable but untested. Code: https: //github. com/gokuhayda/MyShowCase/tree/main/hyperbolic-intelligenceLicense: CC BY-NC-SA 4. 0

HyDRA: Hyperbolic Distillation with Riemannian Adaptation

Key Points

Abstract

Cite This Study