A class-discriminant codebook significantly improved macro-AUC over an unsupervised codebook for single-token ECG signal compression (MD 0.045; 95% CI 0.018-0.059; p≈0.000).
Does a class-discriminant codebook improve diagnostic accuracy (macro-AUC) compared to an unsupervised codebook in single-token ECG signal compression?
A supervised class-discriminant codebook for single-token ECG signal compression significantly improves diagnostic accuracy compared to conventional unsupervised methods by reallocating bits from reconstruction fidelity to task relevance.
Mean Difference: 0.045 (95% CI 0.018–0.059)
Absolute Event Rate: 0.7605% vs 0.7156%
p-value: p=≈0.000
A pre-registered Modal cloud A100 study of single-token signal compression. Mapping an entire signal record to one ≈10-bit codebook index (~1000× reduction) consumed by a decoupled downstream model incurs a compression tax — accuracy lost relative to an uncompressed same-feature classifier. We show this tax is, in substantial part, an artifact of the quantizer's objective rather than its bit budget. A conventional unsupervised codebook (k-means) spends the token's bits minimizing feature reconstruction error; we instead build the codebook within a supervised class-discriminant subspace (linear discriminant analysis, ≤C−1 axes), optionally augmented with a bounded number r of residual principal-component axes, fit on labeled training data only. On a balanced real 12-lead clinical ECG cohort (PTB-XL, five diagnostic superclasses, n≈6,380), the discriminant codebook recovers a statistically significant +0.045 macro-AUC over a same-size unsupervised codebook (0.7156→0.7605; paired-bootstrap p≈0.000, 95% CI +0.018, +0.059), with the encoder faster than the unsupervised baseline because assignment occurs in a reduced subspace; the compression ratio, token interface, downstream model, and energy profile are unchanged. We map the full (K,r) construction surface and confirm an interior optimum at K≈512, r≈8 by a five-seed paired test (best single-token macro-AUC 0.7734, ≈48% of the tax recovered; p≈0.029 versus K=1024). Across token bit budgets from 1 to 11 bits the discriminant codebook Pareto-dominates the unsupervised codebook at every budget — a rate–relevance frontier. The advantage is downstream-model-agnostic: it holds for a naive-Bayes lookup table, logistic regression, an MLP, and a random forest, not only the language-model head (ΔAUC +0.046 to +0.049). The effect is tax-dependent: its direction replicates on a second ECG dataset (MIT-BIH, ΔAUC +0.010) and a non-ECG kinematic modality (smartphone-inertial activity recognition, ΔAUC +0.008, p≈0.017), but its magnitude scales with the size of the single-token tax and is therefore largest on hard, information-rich tasks. We characterize the construction's privacy posture honestly — it reduces within-session but increases cross-session patient re-identification (the residual axes carry subject-stable morphology), a reversal we disclose and mitigate with a residual-free configuration — and note a free codebook-residual novelty/drift monitor. The result is a drop-in, architecturally-free reallocation of a single token's bits from reconstruction fidelity to task relevance. Every claim is anchored to a pre-registered measurement and every boundary, including the negatives, is disclosed. Keywords / index terms: single-token compression; vector quantization; class-discriminant codebook; linear discriminant analysis; rate–relevance frontier; information bottleneck; electrocardiogram; cross-modality generalization; patient re-identification; pre-registration. References: 1. Y. Linde, A. Buzo, and R. Gray, "An algorithm for vector quantizer design," IEEE Trans. Communications, 1980. 2. T. Kohonen, "Learning vector quantization," in Self-Organizing Maps, Springer, 1995. 3. P. Schneider, M. Biehl, and B. Hammer, "Adaptive relevance matrices in learning vector quantization," Neural Computation, 2009. 4. Z. Jiang, Z. Lin, and L. Davis, "Label consistent K-SVD: learning a discriminative dictionary for recognition," IEEE TPAMI, 2013. 5. N. Tishby, F. Pereira, and W. Bialek, "The information bottleneck method," 1999. 6. R. A. Fisher, "The use of multiple measurements in taxonomic problems," Annals of Eugenics, 1936. 7. N. K. Ratha, J. H. Connell, and R. M. Bolle, "Enhancing security and privacy in biometrics-based authentication systems," IBM Systems Journal, 2001. 8. P. Wagner et al., "PTB-XL, a large publicly available electrocardiography dataset," Scientific Data, 2020. 9. G. Moody and R. Mark, "The impact of the MIT-BIH arrhythmia database," IEEE EMB Magazine, 2001. 10. D. Anguita et al., "A public domain dataset for human activity recognition using smartphones," ESANN, 2013. 11. E. Hu et al., "LoRA: low-rank adaptation of large language models," ICLR, 2022. 12. Qwen Team, "Qwen2.5 Technical Report," Alibaba Group, 2024. 13. B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall/CRC, 1993. 14. R. J. Ferlic and K. K. Ferlic, companion deposits (Papers 13–18), Zenodo, 2026. Companion deposits in this Zenodo Community (spiral-domain-encoder-campaign): · Paper 13 — 10.5281/zenodo.20595834 · Paper 14 v3 — 10.5281/zenodo.20596597 · Paper 15 — 10.5281/zenodo.20596931 · Paper 16 — 10.5281/zenodo.20602959 · Paper 17 — 10.5281/zenodo.20634763 · Paper 18 — 10.5281/zenodo.20709337 · Paper 19 — DOI reserved at deposit ← this deposit Related U.S. Provisional Patent Application: the class-discriminant single-token codebook construction disclosed here is covered by Parent N, U.S. Provisional Application No. 64/095,354, filed 2026-06-21, building on the spiral-domain H-pipeline applications (Parents H/I/J/K/L/M). Licensing inquiries: Randolph James Ferlic, M.D., randolphf@fieldstoneanalyticsllc.com. Reproducibility archive released under CC-BY 4.0.
Ferlic et al. (Sun,) conducted a other in ECG classification (n=6,380). Class-discriminant codebook vs. Unsupervised codebook (k-means) was evaluated on macro-AUC (MD 0.045, 95% CI 0.018-0.059, p=≈0.000). A class-discriminant codebook significantly improved macro-AUC over an unsupervised codebook for single-token ECG signal compression (MD 0.045; 95% CI 0.018-0.059; p≈0.000).