What question did this study set out to answer?

This work aims to develop a lightweight language model, MinimalLLM, suitable for resource-constrained environments while maintaining effectiveness and interpretability.

June 23, 2026Open Access

A lightweight transformer based language model inspired by the Qwen architecture for resource constrained environments

Key Points

This work aims to develop a lightweight language model, MinimalLLM, suitable for resource-constrained environments while maintaining effectiveness and interpretability.
Utilized a transformer architecture inspired by the Qwen model family.
Implemented Grouped-Query Attention, RoPE, SwiGLU, and RMSNorm irregular interventions.
Pre-trained on a 500,000 token subset of the Cosmopedia-v2 corpus using mixed precision and gradient accumulation.
Achieved a validation loss of 0.0945 and perplexity of 1.10 with only 32.15 million parameters.
Demonstrated effective design principles of transformer architecture in small-scale settings.
Performed additional analyses showing robustness through comparisons, statistical evaluations, and ablation studies.

Abstract

Abstract Large Language Models have proved to be extremely effective in the realm of NLP; however, they are both computationally expensive and highly complex in terms of architecture, thus limiting their usage in systematic analysis and research due to a lack of computational resources. In this work, we present MinimalLLM, a small-scale decoder-only language model built on a transformer architecture, similar to that used by the Qwen model family. The model attempts to provide a balance between modern architectural quality and efficiency and transparency, with an emphasis being placed on experiment control and interpretability. Our implementation features Grouped-Query Attention (GQA), RoPE, SwiGLU, and RMSNorm. A novel hybrid optimisation approach based on the Muon method used for structured 2D weight matrices and AdamW for all other parameters is proposed to facilitate stable convergence in computational budget-constrained environments. MinimalLLM is pre-trained on a streaming subset of the 500,000 tokens of the Cosmopedia-v2 synthetic corpus through mixed precision training and gradient accumulation on an NVIDIA Tesla T4 GPU. Although it is trained with just 32.15 million parameters, MinimalLLM obtains a validation loss of 0.0945 and perplexity of 1.10 on the domain of Cosmopedia-v2. These figures must be taken as indicative of the optimal convergence achieved by the network for its specific training conditions due to its over-parametrisation, the Zipfian sparsity of the corpus, and the nature of its constrained synthetic domain. Analysis of attentional patterns and of gradients, and confidence calibration, support that optimisation was maintained and that token-level representation has analytic meaning. We show that state-of-the-art Transformer architecture design principles can be implemented effectively in a small-scale setting and describe a reproducible approach to language modeling for research and educational purposes. Additional analysis such as comparison with baselines, performance evaluation using statistical methods, tests with other datasets, and ablation studies reinforce the robustness of our approach.

Bookmark

View Full Paper