Recent advances in 3D human face reconstruction have demonstrated remarkable progress in rendering quality and realism. However, most existing methods critically depend on precise prior knowledge, such as camera intrinsic and extrinsic parameters as well as detailed facial expression annotations, which are costly and impractical to acquire in unconstrained environments. This limitation severely hinders their applicability in real-world scenarios. To address this challenge, we present FaceNeRF--, a novel framework designed for dynamic 3D facial reconstruction and manipulation that requires only sequential facial landmarks as input. By integrating lightweight landmark observations into implicit neural representations, FaceNeRF-- is able to simultaneously estimate head pose and synthesize photorealistic face images, which eliminates the reliance on camera calibration or predefined expression models. Our approach introduces two key innovations. First, we propose hypothetical projecting rays (HPRs), which enable the estimation of ray directions directly from predicted head poses, thereby enabling accurate volumetric rendering without known camera parameters. Second, we develop a masked hierarchical sampling (Masked-HS) strategy that effectively disentangles head pose from facial expressions, allowing the model to avoid overfitting to landmark inputs and to learn a more robust representation of dynamic facial geometry. Together, these techniques form a unified pipeline capable of self-supervised training, efficient inference, and explicit editing of facial expressions and head orientations. Extensive experiments on diverse in-the-wild datasets demonstrate that FaceNeRF-- achieves high-quality dynamic face reconstruction and accurate head pose prediction. In addition, our method supports practical downstream applications, including real-time reenactment, pose manipulation, and expression editing, highlighting its versatility and scalability. Overall, FaceNeRF-- provides a lightweight yet powerful solution for dynamic 3D face modeling, significantly lowering the requirements for data acquisition while maintaining photorealistic synthesis performance.
Huang et al. (Sat,) studied this question.