What question did this study set out to answer?

The aim is to enable photorealistic facial animation on low-end devices without relying on cloud computing or high-end GPUs.

April 11, 2026Open Access

LiveFace: Real-Time Photorealistic Facial Animation on Low-End Mobile Devices via Compact Per-Avatar Neural Decoders and Universal Compositor-Upscaler

Key Points

The aim is to enable photorealistic facial animation on low-end devices without relying on cloud computing or high-end GPUs.
Developed a modular neural rendering system called LiveFace.
Implemented a decomposed per-avatar decoder architecture for facial regions.
Created a universal compositor-upscaler for final image processing.
Utilized a video-driven knowledge distillation pipeline for training the system.
Achieved a real-time animation rate of 30 fps on low-end mobile hardware.
System operates with approximately 20 million parameters and 19 ms latency per frame.
Maintained high realism in animations compared to existing solutions.

Abstract

We present LiveFace, a modular neural rendering system that achieves photorealistic talking-head animation at 30 fps on low-end mobile devices with as little as ~10 GFLOPS of compute (e.g., Qualcomm Snapdragon 439). Prior photorealistic facial animation systems either require cloud infrastructure with 100M+ parameter models (HeyGen, D-ID, Synthesia) or demand desktop-class GPUs (MetaHuman, Audio2Face), while on-device alternatives sacrifice realism for stylized cartoon aesthetics (Apple Memoji, Samsung AR Emoji). LiveFace bridges this gap through three key contributions: (1) a decomposed per-avatar decoder architecture that factorizes the face into four independently rendered regions — mouth, eyes, hair, and body — each handled by a compact neural decoder augmented with a 128-dimensional learnable identity embedding; (2) a universal compositor-upscaler (~7M parameters) shared across all avatars that composites the decoded patches onto a 9:16 portrait canvas and upscales to display resolution in a single forward pass; and (3) a video-driven knowledge distillation pipeline that uses RAVDESS emotional speech videos as driving sources for LivePortrait to generate diverse, naturalistic training data for the student decoders. The full system comprises ~20M INT8 parameters with a total inference latency of ~19 ms per frame, enabling real-time, fully offline operation on commodity mobile hardware without any cloud dependency.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Rodin et al. (Thu,) studied this question.

synapsesocial.com/papers/69d9e67a78050d08c1b76dcc https://doi.org/https://doi.org/10.5281/zenodo.19477081

Bookmark

View Full Paper