What question did this study set out to answer?

This dissertation aims to innovate cost-efficient, monitor-based telepresence by addressing key technological advancements.

May 31, 2026Open Access

Towards AI-Assisted Cost-Efficient Monitor-Based Telepresence

Key Points

This dissertation aims to innovate cost-efficient, monitor-based telepresence by addressing key technological advancements.
Developed a 4-camera system for desktop telepresence combined with a view synthesis algorithm.
Designed a 3D video streaming algorithm for transmitting neural radiance fields efficiently.
Created an algorithm for single-camera 3D portrait reconstruction from monocular video streams.
Achieved faster, higher quality view synthesis and robustness to user variations.
Generated 3D videos with reduced bandwidth and minimal quality loss using a novel algorithm.
Enhanced 3D mesh accuracy and 2D alignment using accurate depth estimation from a single image.

Abstract

This dissertation comprises four papers that collectively advance cost-efficient, monitor-based telepresence by tackling four key areas: (1) view synthesis, (2) 3D video streaming, (3) single-camera portrait reconstruction, and (4) single-image Human-Mesh Recovery. The paper on view synthesis presents a 4-camera capturing system designed for desktop telepresence as well as an accompanying view synthesis algorithm. I present a feed-forward neural network that performs volume rendering using learned features and an efficient 3D point sampling strategy tailored for RGB-D cameras. The result is improved view synthesis quality at a fast speed, robustness to challenging gestures and body movements, as well as the ability to generalize to different users without training. The paper on 3D video streaming presents an algorithm that enables the transmission of 3D videos as neural radiance fields (NeRFs) at consumer-grade bandwidth. I discuss my finding that certain layers of the NeRF neural networks can be held constant throughout a video stream, thus reducing the amount of data transmitted to the client’s end. The result is an algorithm that generates 3D video frames in less time and transmits them at a fraction of the original bandwidth while incurring minimal loss of quality. The paper on single-camera 3D portrait reconstruction presents an algorithm that reconstructs dynamic 3D heads from a monocular video stream. I show that per-frame 3D reconstructions suffer from artifacts in the occluded regions as well as distortions. I show that such distortions and artifacts can be reduced through undistortion and fusion with the help of an additional frontal image of the user. The result is improved temporal consistency of the reconstruction while preserving dynamic details such as lighting and challenging expressions. The paper on single-image Human Mesh Recovery (HMR) introduces a method that performs well for both far away and close range use cases, such as telepresence scenarios. I show that existing methods fail at close-range usage due to inaccurate camera parameters. I show that accurate depth estimation makes it possible to accurately solve for both the camera and human body parameters from a single image. The result is improved 3D mesh accuracy and better 2D alignment to the input image. Each of the aforementioned papers enhances a different facet of telepresence systems across varied system configurations. Collectively, they contribute to building high-quality and cost-efficient telepresence systems that are accessible within typical consumer budgets for hardware, compute, and bandwidth.

Mark Helpful

Bookmark

Relay

View Full Paper