This dissertation comprises four papers that collectively advance cost-efficient, monitor-based telepresence by tackling four key areas: (1) view synthesis, (2) 3D video streaming, (3) single-camera portrait reconstruction, and (4) single-image Human-Mesh Recovery. The paper on view synthesis presents a 4-camera capturing system designed for desktop telepresence as well as an accompanying view synthesis algorithm. I present a feed-forward neural network that performs volume rendering using learned features and an efficient 3D point sampling strategy tailored for RGB-D cameras. The result is improved view synthesis quality at a fast speed, robustness to challenging gestures and body movements, as well as the ability to generalize to different users without training. The paper on 3D video streaming presents an algorithm that enables the transmission of 3D videos as neural radiance fields (NeRFs) at consumer-grade bandwidth. I discuss my finding that certain layers of the NeRF neural networks can be held constant throughout a video stream, thus reducing the amount of data transmitted to the client’s end. The result is an algorithm that generates 3D video frames in less time and transmits them at a fraction of the original bandwidth while incurring minimal loss of quality. The paper on single-camera 3D portrait reconstruction presents an algorithm that reconstructs dynamic 3D heads from a monocular video stream. I show that per-frame 3D reconstructions suffer from artifacts in the occluded regions as well as distortions. I show that such distortions and artifacts can be reduced through undistortion and fusion with the help of an additional frontal image of the user. The result is improved temporal consistency of the reconstruction while preserving dynamic details such as lighting and challenging expressions. The paper on single-image Human Mesh Recovery (HMR) introduces a method that performs well for both far away and close range use cases, such as telepresence scenarios. I show that existing methods fail at close-range usage due to inaccurate camera parameters. I show that accurate depth estimation makes it possible to accurately solve for both the camera and human body parameters from a single image. The result is improved 3D mesh accuracy and better 2D alignment to the input image. Each of the aforementioned papers enhances a different facet of telepresence systems across varied system configurations. Collectively, they contribute to building high-quality and cost-efficient telepresence systems that are accessible within typical consumer budgets for hardware, compute, and bandwidth.
Shengze Wang (Fri,) studied this question.