Speech-driven face generation aims to synthesize a face image that matches a speaker’s identity from speech alone. However, existing methods typically trade identity fidelity for visual quality and rely on large end-to-end generators that are difficult to train and tune. We propose Vox2Face, a speech-driven face generation framework centered on an explicit identity space rather than direct speech-to-image mapping. A pretrained speaker encoder first extracts speech embeddings, which are distilled and metric-aligned to the ArcFace hyperspherical identity space, transforming cross-modal regression into a geometrically interpretable speech-to-identity alignment problem. On this unified identity representation, we reused an identity-conditioned diffusion model as the generative backbone and synthesized diverse, high-resolution faces in the Stable Diffusion latent space. To better exploit this prior, we introduce a discriminator-free diffusion self-consistency loss that treats denoising residuals as an implicit critique of speech-predicted identity embeddings and updates only the speech-to-identity mapping and lightweight LoRA adapters, encouraging speech-derived identities to lie on the high-probability identity manifold of the diffusion model. Experiments on the HQ-VoxCeleb dataset show that Vox2Face improves the ArcFace cosine similarity from 0.295 to 0.322, boosts R@10 retrieval accuracy from 29.8% to 32.1%, and raises the VGGFace Score from 18.82 to 23.21 over a strong diffusion baseline. These results indicate that aligning speech to a unified identity space and reusing a strong identity-conditioned diffusion prior is an effective method to jointly improve identity fidelity and visual quality.
Ma et al. (Sat,) studied this question.