March 3, 2026Open Access

Robust End-to-End Spoken Language Understanding in Low-Resource Settings

Key Points

A speech-conditioned large language model demonstrates effectiveness in understanding unseen languages, showcasing adaptability.
Performance improves significantly with language-specific ASR pretraining, highlighting the importance of tailoring models to unique linguistic features.
Fine-tuning methods indicate that the balance of speech encoder adaptation and decoder stability leads to enhanced performance for instruction-driven SLU tasks.
New insights into lexical diversity and its impact on model generalization potentially enhance future development in low-resource settings.

Abstract

This paper explores the robustness and generalisation capabilities of foundation speech models, such as Whisper, and Large Language Models (LLMs) with speech encoders, like Phi-4, when fine-tuned for the End-to-End (E2E) Spoken Language Understanding (SLU) task in low-resource languages. We investigate the impact of model scale, language-specific pretraining, and specialised encoding strategies for both Whisper-based and LLM-based SLU, while also exploring prompting techniques that enable LLMs to handle tasks such as intent classification and slot filling, bridging the gap between raw audio and language understanding. Our results demonstrate that a speech-conditioned LLM can perform instruction-driven SLU in a language entirely unseen during pretraining. By adapting only the speech encoder of Phi-4 and keeping the decoder frozen, the system achieves competitive performance in Galician through in-context supervision. In parallel, we showthat increasing model scale and applying language-specific ASR pretraining consistently boost performance across all SLU metrics, particularly under the challenging conditions defined in the FalAI dataset. Finally, we present a detailed ablation study, showing that while acoustic redundancy has limited impact beyond a certain threshold, lexical diversity plays a crucial role in supporting robust generalisation. These findings offer new insights into data efficiency and generalisation in low-resource settings.

Robust End-to-End Spoken Language Understanding in Low-Resource Settings

Key Points

Abstract

Cite This Study