August 17, 2025Open Access

Real-Time Speech-to-Text on Edge: A Prototype System for Ultra-Low Latency Communication with AI-Powered NLP

Key Points

The system achieves sub-second end-to-end latency, enhancing communication efficiency.
Real-time streaming is facilitated via WebSocket, while audio capture employs WebRTC for low-latency performance.
Offline capabilities are enabled through automatic speech recognition using the Vosk engine, allowing versatility.
The architecture supports integration of AI models and specific adaptations for diverse use cases.

Abstract

This paper presents a real-time speech-to-text (STT) system designed for edge computing environments requiring ultra-low latency and local processing. Differently from cloud-based STT services, the proposed solution runs entirely on a local infrastructure which allows the enforcement of user privacy and provides high performance in bandwidth-limited or offline scenarios. The designed system is based on a browser-native audio capture through WebRTC, real-time streaming with WebSocket, and offline automatic speech recognition (ASR) utilizing the Vosk engine. A natural language processing (NLP) component, implemented as a microservice, improves transcription results for spelling accuracy and clarity. Our prototype reaches sub-second end-to-end latency and strong transcription capabilities under realistic conditions. Furthermore, the modular architecture allows extensibility, integration of advanced AI models, and domain-specific adaptations.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper