What question did this study set out to answer?

This research examines how BPE tokenizers fragment compound symbols in symbolic languages, impacting compression efficiency.

February 27, 2026Open Access

The Inefficiency of BPE Tokenizers on Symbolic Languages: An Empirical Study and a Simple Fix

Key Points

This research examines how BPE tokenizers fragment compound symbols in symbolic languages, impacting compression efficiency.
Analyzed 43 pairs of symbolic languages using the Qwen 2.5 tokenizer.
Measured token counts, focusing on compression efficiency.
Introduced 26 domain-specific tokens to the tokenizer's vocabulary.
Found BPE tokenizers produce 2-5 sub-tokens from compound symbols.
Improved compression efficiency from 2.65x to 5.62x with new tokens.
Achieved a total gain of 595K effective tokens in a 200K-token context.

Abstract

BPE tokenizers systematically fragment compound symbols from specialized symbolic languages into 2-5 sub-tokens, negating compression gains. We measure this on 43 pairs using the Qwen 2.5 tokenizer (151,665 tokens). Adding only 26 domain-specific tokens—a 0.017% vocabulary increase—improves mean compression by 112.4% (from 2.65x to 5.62x). Applied to a 200K-token context window, this represents a gain of 595K effective tokens. Also available in French: L'inefficience des tokenizers BPE sur les langages symboliques

The Inefficiency of BPE Tokenizers on Symbolic Languages: An Empirical Study and a Simple Fix

Key Points

Abstract

Cite This Study