We propose a shift in how matrix multiplication is executed: instead of treating entries as opaque scalars, we directly access their existing bit-level structure in memory and operate on individual bitplanes. Since all data is already encoded in binary representation, we exploit this existing structure by bypassing scalar interpretation and working directly with the constituent bits. In this view, a product is computed as a sum of Boolean matrix products of bitplanes, reweighted by powers of two, with zero/one/negation multiplications vanishing by construction. Our overlay methods including mode Peeling, Value-Aware Collapse, and post-multiply Filtering/Sensing identify and eliminate redundant register calculations, filtering out only the partial products needed for the actual matrix multiplication result. We develop exact formulas showing per-entry cost requirements in terms of popcount operations for two-sided scenarios, or bit-width dependent costs when one operand has limited precision such as INT8 multiplied by binary values, and prove fixed-point exactness subject to accumulator width constraints. This reorganizes compute from scalar multiplications into bitwise AND, XNOR, POPCNT, and register shifts for powers of two, creating a truly multiplication-free computational path that relies entirely on single-cycle register operations and is cache-friendly, SIMD-friendly, and systolic-friendly. We present theoretical and empirical savings across multiple number systems including binary fields, ternary values, integer modular arithmetic, and fixed-point representations, with normalized performance plots showing measured data points for binary, ternary, and INT8-by-binary scenarios. We demonstrate how overlays reduce computational constants beyond what Strassen and TA48 algorithms achieve. Co-designed with GPUs and TPUs, bitplane-first overlays route semantic-pipeline execution to Boolean kernels while leaving dense computational tiles to tensor cores and systolic arrays, yielding end-to-end speedups. We estimate significant impact for large-scale AI training workloads and demonstrate substantial practical gains for inference and structured computational workloads, bringing the effective computational cost closer to quadratic scaling in regimes where bit-width limitations and semantic filtering bound the per-entry computational work.
Michael Rey (Fri,) studied this question.