What does this research mean for the field?

A rotational scheduling approach for sub-module memory residency enables the execution of large Mixture-of-Experts language models on consumer hardware with limited GPU memory while maintaining viable decode throughput. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This paper aims to explore running large Mixture-of-Experts models on consumer hardware with limited GPU memory.

May 29, 2026Open Access

Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory

Key Points

This paper aims to explore running large Mixture-of-Experts models on consumer hardware with limited GPU memory.
Executed the Qwen3.6-35B-A3B MoE model on an RTX 4060 Laptop GPU with 8 GB of VRAM.
Implemented a rotating resource-management strategy for executing model components.
Measured output tokens, VRAM usage, and decode throughput during public validation.
Generated 2048 output tokens with approximately 6.3 GB of VRAM usage.
Achieved a decode throughput of 21.06 tokens per second with a 10/10 completion rate on a smoke-set evaluation.
Findings suggest feasibility for local execution of large models without dedicated data-center infrastructure.

Abstract

This technical paper presents Rotary GPU, an exploratory execution approach for running large Mixture-of-Experts language models locally on consumer hardware with limited GPU memory. A public validation was conducted using a Qwen3.6-35B-A3B-class MoE model executed on a consumer laptop with an RTX 4060 Laptop GPU containing only 8 GB of VRAM. Under the primary operating configuration, the system generated 2048 output tokens while maintaining approximately 6.3 GB of VRAM usage and an observed decode throughput of 21.06 tokens per second, alongside a 10/10 completion rate on a short smoke-set evaluation. The work derives from a previously disclosed rotary-based accelerator residency concept (Korean Patent Publication KR 10-2026-0070380). Rather than assuming that every model component must remain permanently resident in accelerator memory, the approach treats residency as a rotating resource-management problem in which sub-modules move between execution slots according to structured rotational scheduling. The paper documents externally observable validation results; internal implementation details remain undisclosed. The objective is not to replace data-center infrastructure but to explore whether some capabilities of large models can be brought closer to environments — such as closed-network, on-premise, or resource-constrained organizations — where such infrastructure is unavailable. Results are exploratory rather than definitive, and the validation package requires users to supply their own compatible model files. Part of the ANIMA Research paper series by independent researcher Myeong Jun Jo (ORCID: 0009-0006-9540-4666).

Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory

Key Points

Abstract

Cite This Study