This technical report investigates whether modern architectural and optimization strategies improve language modeling performance at the GPT-2 (124M) scale under a strict 10B-token single-pass training budget. Starting from a GPT-2-style configuration, the study introduces Rotary Positional Embeddings (RoPE), SwiGLU activation, RMSNorm, Query-Key normalization, and a hybrid Muon + AdamW optimization scheme while keeping the parameter count fixed. All models are trained under the same 10B-token single-pass setting. The report also includes a controlled learning-rate schedule ablation (Warmup–Plateau–Cosine vs. Warmup–Cosine) and analyzes both final metrics and training dynamics. Results suggest that the modernized GPT-2-scale recipe improves validation efficiency under a fixed token budget, while downstream gains remain modest and schedule-dependent. This upload contains the public PDF version of the paper.
Hang Lu (Mon,) studied this question.