What question did this study set out to answer?

The aim is to develop a framework for real-time reconstruction of whole-body avatars from monocular videos.

April 3, 2026Open Access

H2Avatar: Expressive Whole-Body Avatars from Monocular Video via Hierarchical Geometry and Hybrid Rendering

Key Points

The aim is to develop a framework for real-time reconstruction of whole-body avatars from monocular videos.
Introduced a mesh-embedded 3D Gaussian representation guided by SMPL-X.
Developed a hierarchical encoding approach for geometry using a multi-scale tri-plane pyramid.
Created a hybrid rendering strategy combining learnable UV texture maps with neural residual color branches.
H2Avatar showed improvements in photorealism and temporal stability compared to existing methods.
Achieved better performance than ExAvatar, with a PSNR improvement of up to 0.66 dB.
Reduced LPIPS scores by up to 16.3%, indicating enhanced visual quality.

Abstract

Reconstructing photorealistic and animatable whole-body avatars from monocular videos is a hot topic in computer vision and computer graphics. However, existing methods still face challenges due to the limited frequency response of single-scale geometry encodings and the instability of appearance modeling without an explicit surface anchor. In this paper, we present H2Avatar, a real-time framework that builds on a mesh-embedded 3D Gaussian representation guided by SMPL-X and disentangles geometry and appearance into hierarchical and hybrid components. For geometry, we propose a semantic-aware hierarchical encoding based on a multi-scale tri-plane pyramid, where features at different resolutions capture both global structure and high-frequency surface details such as clothing wrinkles. For appearance, we introduce a hybrid rendering strategy that anchors canonical colors using a learnable UV texture map, and complements it with a neural residual color branch conditioned on tri-plane features, pose embedding, and surface normals to model pose- and view-dependent shading variations. This design improves temporal stability and preserves identity details while enhancing photorealism under complex motions. Experiments on the NeuMan dataset demonstrate that H2Avatar consistently outperforms representative baselines across multiple sequences, outperforming ExAvatar by up to 0.66 dB in PSNR and reducing LPIPS by up to 16.3%. These results validate the effectiveness of hierarchical geometry encoding and texture-anchored hybrid appearance modeling.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper