The rapid advancement of artificial intelligence and speech processing technologies has significantly enhanced human-computer interaction. However, traditional voice cloning and text-to-speech systems often rely on high-cost infrastructure and generate complete audio before playback, leading to increased latency. This paper presents a Real-Time Voice Cloning and Streaming System designed to generate and stream speech simultaneously with minimal delay. The system operates efficiently on standard personal computers and processes text along with a reference voice sample to produce speech incrementally. The proposed system integrates advanced speech synthesis models, voice encoding techniques, and a low-latency streaming pipeline using WebSocket-based communication. This enables continuous and smooth audio playback without pre-generating the entire audio. The system offers reduced latency, improved efficiency, and enhanced real-time interaction capabilities. It is suitable for applications such as virtual assistants, conversational agents, accessibility tools, and interactive platforms. Keywords: Voice Cloning, Real-Time Streaming, Text-to-Speech, Low Latency, AI.
Building similarity graph...
Analyzing shared references across papers
Loading...
Masroor Hussain
Lalithaditya S
S Kranthi Varma
Aditya Birla (India)
Building similarity graph...
Analyzing shared references across papers
Loading...
Hussain et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69db37044fe01fead37c4f9c — DOI: https://doi.org/10.5281/zenodo.19494307