What question did this study set out to answer?

This research aims to generate high-quality face images from speech by focusing on identity alignment within a defined space.

February 17, 2026Open Access

Vox2Face: Speech-Driven Face Generation via Identity-Space Alignment and Diffusion Self-Consistency

Key Points

This research aims to generate high-quality face images from speech by focusing on identity alignment within a defined space.
Utilized a pretrained speaker encoder to extract speech embeddings.
Transformed speech-to-image mapping into a geometric alignment problem with ArcFace.
Employed a discriminator-free self-consistency loss to optimize speech-to-identity mapping.
Improved ArcFace cosine similarity from 0.295 to 0.322.
Enhanced retrieval accuracy (R@10) from 29.8% to 32.1%.
Increased VGGFace score from 18.82 to 23.21 compared to the baseline.

Abstract

Speech-driven face generation aims to synthesize a face image that matches a speaker’s identity from speech alone. However, existing methods typically trade identity fidelity for visual quality and rely on large end-to-end generators that are difficult to train and tune. We propose Vox2Face, a speech-driven face generation framework centered on an explicit identity space rather than direct speech-to-image mapping. A pretrained speaker encoder first extracts speech embeddings, which are distilled and metric-aligned to the ArcFace hyperspherical identity space, transforming cross-modal regression into a geometrically interpretable speech-to-identity alignment problem. On this unified identity representation, we reused an identity-conditioned diffusion model as the generative backbone and synthesized diverse, high-resolution faces in the Stable Diffusion latent space. To better exploit this prior, we introduce a discriminator-free diffusion self-consistency loss that treats denoising residuals as an implicit critique of speech-predicted identity embeddings and updates only the speech-to-identity mapping and lightweight LoRA adapters, encouraging speech-derived identities to lie on the high-probability identity manifold of the diffusion model. Experiments on the HQ-VoxCeleb dataset show that Vox2Face improves the ArcFace cosine similarity from 0.295 to 0.322, boosts R@10 retrieval accuracy from 29.8% to 32.1%, and raises the VGGFace Score from 18.82 to 23.21 over a strong diffusion baseline. These results indicate that aligning speech to a unified identity space and reusing a strong identity-conditioned diffusion prior is an effective method to jointly improve identity fidelity and visual quality.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Ma et al. (Sat,) studied this question.

synapsesocial.com/papers/699405774e9c9e835dfd64d4 https://doi.org/https://doi.org/10.3390/info17020200

Bookmark

View Full Paper