What question did this study set out to answer?

The study aims to investigate the effect of music pre-training on language acquisition in Transformer models.

April 25, 2026Open Access

Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

Key Points

The study aims to investigate the effect of music pre-training on language acquisition in Transformer models.
Pre-training a Transformer on piano performances from the MAESTRO dataset.
Developed a pipeline moving from music to poetry to prose.
Conducted convergence tests with multiple seeds and varying model capacities.
Achieved a 17.5% perplexity improvement over random initialization ($p < 0.001$).
Validation showed a persistent 5.5% advantage at plateau with $p = 0.017$.
Optimal pre-training data volume shifts with model capacity, offering a $-3 o +6 ext{%}$ advantage with larger datasets.

Abstract

We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline---music poetry prose---yields a 17. 5% perplexity improvement over random initialization (p < 0. 001, 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at d\!=\!64, multi-seed validation (5 seeds) shows a persistent 5. 5% gap at plateau (p = 0. 017), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal thatoptimal pre-training data volume shifts with model capacity (-3\% +3\% +6\% advantage of larger datasets from d\!=\!16 to d\!=\!64). Across the scales we study (d\!\!\16, 32, 64\, up to 400K parameters), these results suggest capacity-dependent data curation principle and indicate that structured human creative outputs can provide an efficient pre-training substrate for small language models; stronger conclusions at modern pre-training scale will require substantially larger experiments.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper