Electromyography (EMG) signals, measuring muscle activity, are investigated for Silent Speech Interfaces (SSIs) to enable speech communication via silent articulation. The previous paradigm for EMG-to-Speech conversion relies on speaker-dependent models predicting acoustic features of speech from the same speaker providing EMG inputs. However, this approach makes SSI applications limited, as it 1) cannot be used to synthesize the personal voice of individuals unable to produce audible speech during EMG recording, 2) suffers from data scarcity, requiring each speaker to record a sizable corpus, and 3) leads to unintelligible speech in low-latency settings. The problem of converting EMG signals to speech in personal voices (1) is addressed by using voice conversion methods that disentangle phonetic and voice timbre information. The proposed voice-adaptive EMG-to-Speech models predict speech content features, mostly reflecting phonetic content, from EMG signals and combine them with reference audio of the target voice for speech synthesis. Further evaluations demonstrate that such models can be trained using EMG signals of silent speech only. The data scarcity problem (2) is addressed by several studies. For this purpose, EMG models are pre-trained with other biosignals, unlabeled EMG signals, and labeled EMG signals of multiple speakers, i.e., cross-speaker training. In particular, cross-speaker training improves average speech synthesis intelligibility, while eliminating the need to train speaker-specific models. To improve EMG-to-Speech in low-latency settings (3), this work presents an end-to-end model which outperforms previous low-latency baselines in speech intelligibility and naturalness while generating speech in less than 20 ms algorithmic latency. Furthermore, combining the previously outlined contributions, this work introduces a unified model which can convert EMG signals of multiple speakers to selectable voices.
Kevin Scheck (Fri,) studied this question.