What question did this study set out to answer?

The research aims to uncover the resilience of neural network identities against model theft and assess the effectiveness of detection methods.

March 2, 2026Open Access

The Geometry of Model Theft: Distillation Forensics, Adversarial Erasure, and the Illusion of Spoofing

Puntos clave

The research aims to uncover the resilience of neural network identities against model theft and assess the effectiveness of detection methods.
Used a framework of log-prob order-statistic geometry for analysis
Conducted experiments across 72 checkpoints examining both structural and functional identities
Investigated the effects of adversarial erasure and passive fine-tuning on forensic detection
Functional identity showed a 52% convergence toward the teacher’s template in student models
Adversarial methods only transiently suppressed forensic traces, while passive fine-tuning was more effective at erasure
Cross-family spoofing achieved 69.4% resemblance to a decoy’s characteristics, highlighting geometric vulnerabilities

Resumen

Recent disclosures of industrial-scale knowledge distillation — including campaigns comprising millions of fraudulent API exchanges targeting frontier models Anthropic, 2026 — have made post-hoc detection of model theft a critical security requirement. Building on a formally-verified framework of log-prob order-statistic geometry, we investigate the adversarial resilience of neural network identity across 72 experimental checkpoints. We establish a Two-Layer Identity Hypothesis: a model’s structural identity (weights-regime geometry) is empirically invariant to distillation (within acceptance threshold epsilon across all 18 protocols), while its functional identity (API-regime Poisson Point Process residuals) predictably transfers to the student, converging up to 52% toward the teacher’s template. Stress-testing this forensic channel against a white-box adversary, we find that functional provenance is geometrically coupled to the knowledge transfer objective. Adversarial erasure gradients are consistently dominated by the distillation loss, achieving only a transient suppression that rebounds within one epoch. Passive fine-tuning on fresh data erases the trace more effectively than any adversarial method, but at a measurable cost to general capability — revealing a Pareto frontier with no favorable region for the adversary. This establishes API forensics as a time-sensitive detective control (“The Tripwire”) and weights-regime identity as the immutable anchor (“The Vault”). Finally, we observe an apparent vulnerability: a cross-family adversarial spoofing attack achieves 69. 4% convergence toward a decoy’s fingerprint, while same-family spoofing catastrophically fails. We resolve this paradox by mapping the PPP-residual vector space, revealing that models cluster by capability topology, not corporate lineage. Cross-family “spoofing” is a spatial illusion caused by a narrow 7. 8 degree alignment between the decoy and the primary distillation trajectory (R2 = 0. 995), whereas same-family decoys are anti-aligned. Across all adversarial interventions, the underlying Gumbel universality (deltaₙorm) remains invariant (CV = 1. 9%). We conclude that during active distillation, an adversary cannot simultaneously acquire a teacher’s capabilities and erase or redirect the forensic trace. In this setting, the geometry forbids it.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo