May 13, 2024

A stereo vision-based real-time 3D hand pose estimation system combining nonlinear optimization

Key Points

Key points are not available for this paper at this time.

Abstract

Gesture interaction is the most primitive and natural way for humans to interact and plays a crucial role in virtual reality and augmented reality technologies. It enables control of virtual environments, such as selecting, moving, and rotating virtual objects using gestures. While there are various 2D pose estimation methods based on convolutional neural networks (CNNs) that can be tracked and labeled from 2D videos, real-world gesture interactions occur in 3D space. Common 3D pose estimation methods rely on supervised learning and yield accurate results but are costly in terms of obtaining 3D data through camera calibration and annotation. Moreover, the limitations of mobile computing power hinder the deployment of advanced algorithms, posing challenges for industrial applications. To address the difficulty of acquiring 3D annotated data and the limitations of mobile algorithms, this paper proposes a lightweight approach that combines hand biomechanics and nonlinear optimization, enabling 3D pose estimation with binocular cameras during training without relying on extensive 3D data labeling. We employ a lightweight model based on convolutional neural networks to detect and track hand keypoints in binocular cameras, followed by the computation of reprojection error. Reprojection error serves as the optimization objective in 3D pose estimation, allowing for more accurate 3D camera coordinates by minimizing this error. Constraints on palm size and joint lengths are applied to prevent unrealistic hand poses. Finally, the Levenberg- Marquardt algorithm is used for nonlinear optimization to obtain the optimal 3D hand pose estimation. We conducted experiments on a test gesture dataset and compared our method with mediapipe, demonstrating our advantages in accuracy and real-time performance. Furthermore, we deployed the system on augmented reality glasses powered by the RK3588 SOC and utilized NPU acceleration, achieving a frame rate of 50PFS. The proposed 2.5D pose estimation model based on binocular cameras and nonlinear optimization leverages information from multiple viewpoints, resulting in more accurate 3D pose estimation suitable for virtual reality and augmented reality applications. It handles noise, mismatches, and hand occlusions, exhibiting superior robustness in complex scenarios.

Mark Helpful

Bookmark

Relay