BPE tokenizers systematically fragment compound symbols from specialized symbolic languages into 2-5 sub-tokens, negating compression gains. We measure this on 43 pairs using the Qwen 2.5 tokenizer (151,665 tokens). Adding only 26 domain-specific tokens—a 0.017% vocabulary increase—improves mean compression by 112.4% (from 2.65x to 5.62x). Applied to a 200K-token context window, this represents a gain of 595K effective tokens. Also available in French: L'inefficience des tokenizers BPE sur les langages symboliques
Cros et al. (Sun,) studied this question.