Speculative decoding accelerates large language model inference by using a small model to draft tokens that a large model verifies. We invert this paradigm: rather than using a small model to approximate a large one, we use a Diffusion Language Model (DLM) to structurally elevate a small one. We exploit the permanence property of absorbing-state masked diffusion—tokens committed during denoising are irrevocable—to extract anchor skeletons from as few as 10% of denoising steps. A 0.5B-parameter autoregressive model fills gaps between these anchors via forced decoding, achieving 0.82–0.93 F1 against the DLM’s full output (N=190 across four benchmarks). Ablation experiments (N=50) demonstrate that token identity, not position, drives anchor effectiveness: correct tokens at random positions yield 0.90 F1, while random tokens at correct positions yield 0.002. Gap-only decomposition shows the gap-filler more than doubles its unconstrained performance at non-anchor positions (0.475 vs. 0.231 word F1), with no correlation between anchor density and gap quality (ρ = −0.037, p = 0.876), confirming genuine conditional modeling rather than density inflation. We further show that a 0.5B gap-filler matches a 1.5B gap-filler when anchors are provided (∆ = −0.009), suggesting the DLM provides sufficient semantic structure. These findings establish an inverse speculation framework for cloud/edge deployment where a DLM transmits a compressed anchor-template payload—43% smaller than gzip-compressed full text—to enable high-fidelity reconstruction of DLM output on sub-billion-parameter edge devices. Component profiling on A100 hardware confirms a 2.3× sequential pipeline speedup, with cloud compute reduced by 89.7%.
Benjamin Wade (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: