No existing AI safety benchmark measures whether a large language model maintains Arabic-only output constraints under sustained multi turn pressure. This paper introduces the first Arabic language constraint persistence evaluation and reports results from three frontier models tested against twenty probes across five constraint subtypes relevant to Gulf sovereign AI deployment. Results show wide variance across models. One model achieves 0.85. One crosses the minimum 0.7 deployment threshold at 0.7143. One fails below threshold at 0.6471. The findings establish that Arabic language constraint persistence is a measurable, quantifiable property that differs materially across models and is not predictable from general capability benchmarks.
Ahmad A (Sat,) studied this question.