Background Extended Reality (XR) technologies offer transformative potential for language education, yet current platforms largely neglect the accessibility needs of deaf and hard-of-hearing individuals. Existing solutions typically operate in single-language environments and lack integrated support for sign language and multimodal communication. There is a critical need for inclusive platforms that serve both deaf and hearing learners through cross-modal AI services embedded in immersive environments. Methods This study presents a modular platform integrating six AI services: speech-to-text transcription (OpenAI Whisper), multilingual translation (Meta NLLB), text-to-speech synthesis (AWS Polly), sentiment analysis (RoBERTa), session summarisation (flan-t5-base-samsum), and International Sign (IS) translation via Google MediaPipe. An IS dataset of 750 gesture videos was processed to extract hand landmark coordinates mapped to 3D avatar animations within a Unity-based VR environment on Meta Quest 3 headsets. The system was validated through technical benchmarking of AI service performance, including comparative evaluation of text-to-speech services and multilingual translation models (NLLB-200 and EuroLLM 1.7B variants), load testing to assess platform. scalability, and end-to-end pipeline latency measurement for both the hearing and the deaf user pathways. The educational scenario was additionally evaluated in a companion pilot study, 50 which shares the same underlying AI services and provides complementary user-perception evidence. Results Technical benchmarking confirmed the platform’s viability for real-time XR deployment. TTS benchmarking confirmed AWS Polly’s lowest latency (50–100 ms first byte) at competitive cost. The EuroLLM 1.7B Instruct model achieved a BLEU score of 84.34, outperforming NLLB’s 79.25. Load testing with 1,000 simulated concurrent users demonstrated average response times under 800 milliseconds with no critical failures. Avatar animation latency for IS sign rendering remained consistently under 300 milliseconds. End-to-end pipeline latency averaged 2.05 ± 0.31 s for the hearing pathway and 2.32 ± 0.34 s for the deaf (IS) pathway, both within accepted thresholds for conversational educational use. The companion pilot (N = 10) reported a mean overall experience rating of 4.6/5.0, 92% user satisfaction and unanimous (100%) demand for expanded language and sign-language support. 50 Conclusions The results presented in this study focus on the technical feasibility of integrating cross-modal AI services within XR environments for accessible, multilingual language learning. The modular architecture enables independent scaling and adaptation to diverse contexts, laying the groundwork for equitable educational solutions aligned with EU digital accessibility objectives.
Tantaroudas et al. (Sat,) studied this question.