Monocular dynamic video reconstruction is a typical ill-posed problem due to the limited observations and complex 3D motions. Despite the recent advances in dynamic 3D Gaussian splatting techniques, most of them still struggle with the monocular setting, since they heavily rely on geometric cues from multiple cameras or ignore the structural coherence among the optimized 3D Gaussains. To address this, we propose Hie4DGS, a novel hierarchical structure representation to model the complex dynamic motions from monocular dynamic videos. Specifically, we decompose the motions of a dynamic scene into groups of multiple structure granularities and progressively compose them to derive the motion of each 3D Gaussian. Building on this representation, we leverage hierarchical semantic segmentation to group Gaussians and initialize their motion using depth and tracking priors within each group. Additionally, we introduce a structure rendering loss that enforces consistency between the learned motion structure and semantic priors, further reducing motion ambiguity. Compared to the state-of-the-art dynamic Gaussian methods, we achieve significant improvement in rendering quality on monocular video datasets featuring complex real-world motions.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kai Cheng
Kaizhi Yang
Xiaoxiao Long
Building similarity graph...
Analyzing shared references across papers
Loading...
Cheng et al. (Wed,) studied this question.
www.synapsesocial.com/papers/697460cebb9d90c67120aaa4 — DOI: https://doi.org/10.1109/tvcg.2026.3656737