March 3, 2026Open Access

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Key Points

Machine translation and topic classification show improved performance using the Piedmontese dataset, highlighting its utility.
The evaluation dataset includes diverse linguistic features, fostering better understanding of non-standard orthography outcomes.
Analysis relies on advanced datasets like FLORES+ and SIB-200 to assess language models' capabilities across languages.
Utilization of crowdsourcing enhances dataset richness, yet further exploration on non-standard forms is necessary.

Abstract

This dataset contains data for testing machine translation and topic classification in Piedmontese. It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects (Adelani et al., EACL 2024).

Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Key Points

Abstract

Cite This Study