Neural machine translation (NMT) performance is strongly influenced by tokenization strategies, particularly for morphologically rich languages such as Arabic. Despite the importance of tokenization, there is a lack of controlled, reproducible studies examining its impact under low-resource conditions, which limits our understanding of how different methods affect translation quality and training dynamics. This paper presents a controlled experimental study analyzing the impact of different tokenization methods on English → Arabic (EN → AR) translation using a Tiny Transformer model under low-resource conditions. The study aims to provide a systematic and reproducible comparison that isolates the effect of tokenization choices under fixed modeling and training constraints. Experiments are conducted with identical architecture, training steps, decoding procedure, and evaluation pipeline to ensure reproducibility. Translation quality is assessed using multiple metrics including BLEU, ChrF++, TER, and BERTScore, revealing substantial divergences and demonstrating empirically, in the context of low-resource Arabic NMT, that BLEU alone is insufficient for reliable evaluation. While the limitations of BLEU are known in general, our results provide new evidence showing that, under low-resource conditions and across different tokenization strategies, reliance on BLEU can lead to misleading conclusions about translation quality. Training dynamics are analyzed using TensorBoard, linking tokenization strategies to differences in convergence, saturation, and stability. For validation, a small-scale English → German (EN → DE) experiment confirms that the Tiny Transformer setup reproduces expected behavior. The contribution of this work lies in establishing controlled empirical evidence and practical insights, rather than absolute performance gains, for low-resource Arabic NMT. Our results provide controlled evidence that tokenization choice critically affects both translation quality and optimization dynamics, offering practical guidance for low-resource Arabic NMT research. Overall, byte-pair encoding (BPE) achieves the strongest balance across surface-level and semantic metrics under controlled low-resource conditions (BLEU: 8.57, ChrF++: 18.56, TER: 97.38, BERTScore-F1: 0.785). Character-level tokenization yields higher semantic similarity than subword-based methods, as reflected by BERTScore, but remains weaker in structural fidelity and surface-form accuracy, while SentencePiece exhibits intermediate behavior, favoring semantic adequacy over exact n-gram matching. These results confirm that tokenization choice critically influences both evaluation outcomes and optimization behavior, and that BLEU alone is insufficient for assessing Arabic translation quality.
Alrashidi et al. (Sat,) studied this question.