What question did this study set out to answer?

The aim is to enhance the efficiency of long-context generation using masked diffusion language models without exceeding GPU memory limits.

June 17, 2026Open Access

SW-SpeedDLM: Sliding Window Speculative Decoding for Diffusion Language Models Under Long Context Constraints

Key Points

The aim is to enhance the efficiency of long-context generation using masked diffusion language models without exceeding GPU memory limits.
Introduced SW-SpeedDLM, an inference wrapper for MDLMs with three components.
Used Segmented Sliding Window Denoising to limit each denoising step to a W token window.
Implemented Cross-Segment KV Compression and Window Level Speculative Acceptance for increased decoding speed.
Achieved 3.7× higher throughput at n=8192 compared to full attention at n=2048.
Reduced peak memory usage by 2.5× compared to full attention at n=4096.
Increased PG-19 bits per character by only 0.18 with improved performance.

Abstract

Masked diffusion language models (MDLMs) apply full bidirectional attention at every denoising step, which incurs O(Tn2) cost in the number of steps T and the sequence length n. For an 8B parameter model at n = 8192 with T = 128, the KV cache alone exceeds 40 GB and rules out long document generation on a single GPU. We introduce SW-SpeedDLM, an inference wrapper for pretrained MDLMs that generates sequences of up to 16,384 tokens on one A100 with 40 GB. The framework comprises three components, each targeting a distinct bottleneck. Segmented SlidingWindow Denoising (SSWD) restricts each denoising loop to a window of W tokens and reduces the per-step cost to O(W2n/S). Cross-Segment KV Compression (CSKV) encodes each completed window into C summary tokens that later windows attend to at O(C) cost. Window Level Speculative Acceptance (WLSA) lets a small draft model propose k denoising steps that the target model verifies in a single forward pass, yielding up to 2.4× per window speedup. We prove that WLSA preserves the exact marginal distribution of the target model. On MDLM-860M and LLADA-8B across PG-19, LongBench, and WritingPrompts, SW-SpeedDLM achieves 3.7× higher throughput at n = 8192 than full attention generation at n = 2048 (its maximum feasible length) and lowers peak memory by 2.5× relative to full attention at n = 4096, the longest length full attention can support before exhausting GPU memory, while also increasing PG-19 bits per character by only 0.18.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper