What question did this study set out to answer?

The aim is to develop a framework for effective 3D facial tracking that incorporates emotion and semantic analysis.

March 6, 2026Open Access

HMamba-3DFT: A Hierarchical Mamba Framework for Emotion-driven Semantic 3D Facial Tracking

Key Points

The aim is to develop a framework for effective 3D facial tracking that incorporates emotion and semantic analysis.
Proposed HMamba-3DFT framework tailored for 3D facial tracking.
Implemented BSTV-Mamba with BSTS-Scan for capturing spatiotemporal facial dynamics.
Developed a dual optimization strategy integrating emotion-driven modeling with semantic alignment.
Used monocular video-based inputs for analysis.
The framework efficiently tracks variations in 3D facial shapes from videos.
Competitive performance demonstrated when evaluated against state-of-the-art methods.
Enhanced optimization resulted in higher accuracy of reconstructed 3D facial meshes.

Abstract

• First Mamba-based framework HMamba-3DFT tailored for 3D facial tracking • BSTV-Mamba with BSTS-Scan capture spatiotemporal facial dynamics • Dual optimization integrates dynamic emotion-driven modeling with semantic alignment Monocular video-based 3D face tracking is vital for interactive pattern recognition and human avatars. Most existing image-based methods fail to model temporal dependencies in video, causing jitter and inaccuracies. Furthermore, they also often neglect the continuous multi-modal signals present in facial videos such as expression dynamics and emotional cues that provide essential temporal drivers for facial modeling. To this end, this study first explores the Mamba architecture tailored for 3D facial tracking by proposing a hierarchical Mamba framework, termed HMamba-3DFT. The proposed network can efficiently capture and track variations in 3D facial shapes from a monocular video. To exploit the global spatiotemporal correlations across frames of the dynamic face, we develop a bidirectional spatiotemporal vision Mamba (BSTV-Mamba) module featuring a bidirectional spatiotemporal selective scan (BSTS-Scan) mechanism. To capture temporally evolving multi-modal emotion signals embedded in continuous video sequences, we introduce a dynamic emotion-driven mechanism. Additionally, to mitigate the potential degradation of reconstruction fidelity caused by an over-reliance on emotion-driven cues, we integrate facial semantic alignment with facial emotion driving to enhance the accuracy of emotion-driven facial modeling. This integrated dual-optimization strategy systematically guides the network during training, ensuring that the reconstructed 3D facial mesh not only accurately captures the emotional attributes of the input frames but also benefits from enhanced optimization for more precise reconstruction. Extensive evaluations on benchmark datasets show competitive performance against state-of-the-art methods.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Haodong Jin

Muwei Jian

Linyi University

Derui Ding

Journals

Pattern Recognition

Actions

Institutions

University of Glasgow

University of Shanghai for Science and Technology

Shandong University of Finance and Economics

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

HMamba-3DFT: A Hierarchical Mamba Framework for Emotion-driven Semantic 3D Facial Tracking

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study