What question did this study set out to answer?

The research aims to enhance Arabic conversational systems by creating a high-quality dataset for multi-turn dialogues.

February 14, 2026Open Access

Fine-Tuning Arabic Large Language Models for improved multi-turn dialogue: A blueprint for synthetic data generation and benchmarking

Key Points

The research aims to enhance Arabic conversational systems by creating a high-quality dataset for multi-turn dialogues.
Constructed a dataset with 43,316 multi-turn conversations across 93 topics.
Fine-tuned two Arabic LLMs (ArabianGPT-08B-V2 and AraGPT2-mega) on the synthetic data.
Benchmarking involved automatic metrics (Perplexity and RAVEN) and human evaluations.
ArabianGPT-08B-V2 achieved the highest RAVEN score of 0.823.
Maintained an acceptable perplexity score of 9.4 within the model.
Human evaluation produced overall quality scores between 4.04 and 4.34 on a five-point scale.

Abstract

The rapid evolution of Large Language Models (LLMs) has fueled increasing interest in developing Arabic conversational systems capable of sustaining coherent multi-turn dialogues. However, progress remains constrained by the scarcity of large-scale, diverse, and high-quality datasets specifically designed for Arabic multi-turn interaction. This study presents a reproducible methodology for constructing such a dataset through structured prompting of an instruction-tuned Arabic LLM (Jais-13b-chat), yielding 43,316 multi-turn conversations across 93 topics and 151 countries. Two pre-trained Arabic language models (ArabianGPT-08B-V2 and AraGPT2-mega) were fine-tuned on this synthetic data and benchmarked against multilingual instruction-tuned baselines using a comprehensive evaluation framework combining automatic metrics (Perplexity and RAVEN) with structured human evaluation. Fine-tuned ArabianGPT-08B-V2 achieved the highest RAVEN score (0.823) for cross-model comparison, outperforming both fine-tuned AraGPT2-mega and instruction-tuned baselines while maintaining strong within-model perplexity (9.4). Human evaluation by two independent raters demonstrated acceptable inter-rater reliability (Cohen’s κ = 0.229–0.739) with positive rank correlations (Spearman ρ = 0.424–0.759), yielding overall quality scores of 4.04–4.34 on a five-point scale. These findings demonstrate that high-quality, LLM-generated synthetic data effectively improves Arabic conversational models, providing a scalable, resource-efficient blueprint for dialogue systems in low-resource and culturally specific settings.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ahmed Mahmoud Misbah

Arab Academy for Science, Technology, and Maritime Transport

Mohamed Farouk

Arab Academy for Science, Technology, and Maritime Transport

Mustafa AbdulAzim

Arab Academy for Science, Technology, and Maritime Transport

Journals

PLoS ONE

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Fine-Tuning Arabic Large Language Models for improved multi-turn dialogue: A blueprint for synthetic data generation and benchmarking

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study