What type of study is this?

This is a Quantitative Study study.

October 13, 2025Open Access

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Key Points

Low-rank orthogonalization significantly enhances matrix optimization during neural network training.
Numerical experiments show low-rank Muon outperformed the standard Muon on foundation models like GPT-2 and LLaMA.
The paper establishes iteration complexity for both low-rank matrix-signed gradient descent and the low-rank Muon optimizer.
Findings suggest that leveraging low-rank structures may lead to superior training strategies for large-scale neural networks.

Abstract

Neural network (NN) training is inherently a large-scale matrix optimization problem, yet the matrix structure of NN parameters has long been overlooked. Recently, the optimizer Muon jordanmuon, which explicitly exploits this structure, has gained significant attention for its strong performance in foundation model training. A key component contributing to Muon's success is matrix orthogonalization. In this paper, we propose low-rank orthogonalization, which explicitly leverages the low-rank nature of gradients during NN training. Building on this, we propose low-rank matrix-signed gradient descent and a low-rank variant of Muon. Our numerical experiments demonstrate the superior performance of low-rank orthogonalization, with the low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining -- surpassing the performance of the carefully tuned vanilla Muon. Theoretically, we establish the iteration complexity of the low-rank matrix-signed gradient descent for finding an approximate stationary solution, as well as that of low-rank Muon for finding an approximate stochastic stationary solution under heavy-tailed noise.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper