Reconstructing photorealistic and animatable whole-body avatars from monocular videos is a hot topic in computer vision and computer graphics. However, existing methods still face challenges due to the limited frequency response of single-scale geometry encodings and the instability of appearance modeling without an explicit surface anchor. In this paper, we present H2Avatar, a real-time framework that builds on a mesh-embedded 3D Gaussian representation guided by SMPL-X and disentangles geometry and appearance into hierarchical and hybrid components. For geometry, we propose a semantic-aware hierarchical encoding based on a multi-scale tri-plane pyramid, where features at different resolutions capture both global structure and high-frequency surface details such as clothing wrinkles. For appearance, we introduce a hybrid rendering strategy that anchors canonical colors using a learnable UV texture map, and complements it with a neural residual color branch conditioned on tri-plane features, pose embedding, and surface normals to model pose- and view-dependent shading variations. This design improves temporal stability and preserves identity details while enhancing photorealism under complex motions. Experiments on the NeuMan dataset demonstrate that H2Avatar consistently outperforms representative baselines across multiple sequences, outperforming ExAvatar by up to 0.66 dB in PSNR and reducing LPIPS by up to 16.3%. These results validate the effectiveness of hierarchical geometry encoding and texture-anchored hybrid appearance modeling.
Zhang et al. (Wed,) studied this question.