What question did this study set out to answer?

This work aims to improve the intelligibility of speech synthesis from EMG signals by integrating voice conversion techniques and cross-speaker training.

February 20, 2026Open Access

Personalizing Myoelectric Silent Speech Interfaces via Cross-Speaker Training and Voice Timbre Control

Key Points

This work aims to improve the intelligibility of speech synthesis from EMG signals by integrating voice conversion techniques and cross-speaker training.
Investigate EMG-to-Speech conversion using voice-adaptive models.
Employ voice conversion to disentangle phonetic content from voice timbre.
Utilize cross-speaker training to enhance model performance with limited data.
Implement an end-to-end model for low-latency speech generation.
Voice-adaptive models significantly enhance speech intelligibility.
Cross-speaker training reduces the need for speaker-specific data, improving generalization.
The proposed model achieves speech generation in under 20 ms with enhanced naturalness.

Abstract

Electromyography (EMG) signals, measuring muscle activity, are investigated for Silent Speech Interfaces (SSIs) to enable speech communication via silent articulation. The previous paradigm for EMG-to-Speech conversion relies on speaker-dependent models predicting acoustic features of speech from the same speaker providing EMG inputs. However, this approach makes SSI applications limited, as it 1) cannot be used to synthesize the personal voice of individuals unable to produce audible speech during EMG recording, 2) suffers from data scarcity, requiring each speaker to record a sizable corpus, and 3) leads to unintelligible speech in low-latency settings. The problem of converting EMG signals to speech in personal voices (1) is addressed by using voice conversion methods that disentangle phonetic and voice timbre information. The proposed voice-adaptive EMG-to-Speech models predict speech content features, mostly reflecting phonetic content, from EMG signals and combine them with reference audio of the target voice for speech synthesis. Further evaluations demonstrate that such models can be trained using EMG signals of silent speech only. The data scarcity problem (2) is addressed by several studies. For this purpose, EMG models are pre-trained with other biosignals, unlabeled EMG signals, and labeled EMG signals of multiple speakers, i.e., cross-speaker training. In particular, cross-speaker training improves average speech synthesis intelligibility, while eliminating the need to train speaker-specific models. To improve EMG-to-Speech in low-latency settings (3), this work presents an end-to-end model which outperforms previous low-latency baselines in speech intelligibility and naturalness while generating speech in less than 20 ms algorithmic latency. Furthermore, combining the previously outlined contributions, this work introduces a unified model which can convert EMG signals of multiple speakers to selectable voices.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper