Abstract Large Language Models have proved to be extremely effective in the realm of NLP; however, they are both computationally expensive and highly complex in terms of architecture, thus limiting their usage in systematic analysis and research due to a lack of computational resources. In this work, we present MinimalLLM, a small-scale decoder-only language model built on a transformer architecture, similar to that used by the Qwen model family. The model attempts to provide a balance between modern architectural quality and efficiency and transparency, with an emphasis being placed on experiment control and interpretability. Our implementation features Grouped-Query Attention (GQA), RoPE, SwiGLU, and RMSNorm. A novel hybrid optimisation approach based on the Muon method used for structured 2D weight matrices and AdamW for all other parameters is proposed to facilitate stable convergence in computational budget-constrained environments. MinimalLLM is pre-trained on a streaming subset of the 500,000 tokens of the Cosmopedia-v2 synthetic corpus through mixed precision training and gradient accumulation on an NVIDIA Tesla T4 GPU. Although it is trained with just 32.15 million parameters, MinimalLLM obtains a validation loss of 0.0945 and perplexity of 1.10 on the domain of Cosmopedia-v2. These figures must be taken as indicative of the optimal convergence achieved by the network for its specific training conditions due to its over-parametrisation, the Zipfian sparsity of the corpus, and the nature of its constrained synthetic domain. Analysis of attentional patterns and of gradients, and confidence calibration, support that optimisation was maintained and that token-level representation has analytic meaning. We show that state-of-the-art Transformer architecture design principles can be implemented effectively in a small-scale setting and describe a reproducible approach to language modeling for research and educational purposes. Additional analysis such as comparison with baselines, performance evaluation using statistical methods, tests with other datasets, and ablation studies reinforce the robustness of our approach.
Kumar et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: