This research addresses the severe CPU-bound dispatch bottlenecks that cripple LLaDA-MoE performance during the generation phase. The core solution is a custom 2D fused Triton kernel that shifts the parallelization strategy from standard token-centric partitioning to the expert's internal intermediate dimension. This ensures high GPU utilization even in "skinny matrix" scenarios where experts receive only 10 to 50 tokens, which normally leads to massive hardware under-utilization. By fusing the entire pipeline and moving data-flow management directly to the GPU hardware, the kernel replaces slow CPU-side sorting with fast hardware-level accumulation. The benchmarks demonstrate a paradigm shift from host-bound to execution-bound processing, resulting in a 6x throughput improvement that exceeds 300 tokens per second. Fixed dispatch costs were reduced by 20 times, and the CPU-side overhead was cut roughly in half compared to native implementations. Despite these significant performance gains, the VRAM footprint remained identical to the baseline, holding steady at approximately 14 GiB with less than 1% fluctuation. This confirms that hardware-aware kernel optimization can virtually eliminate architectural overhead without sacrificing memory stability or generation quality on LLaDA-MoE.
Building similarity graph...
Analyzing shared references across papers
Loading...
Aleksei Manakonov
Building similarity graph...
Analyzing shared references across papers
Loading...
Aleksei Manakonov (Thu,) studied this question.
www.synapsesocial.com/papers/69be38da6e48c4981c6797f3 — DOI: https://doi.org/10.5281/zenodo.19116469
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: