What question did this study set out to answer?

This research aims to improve the reliability of pose estimates in CNN-based object recognition.

April 29, 2026Open Access

Uncertainty driven pose estimation - for rigid objects in CNN-based pipelines

Key Points

This research aims to improve the reliability of pose estimates in CNN-based object recognition.
Proposes a fusion-based CNN approach for pose estimation.
Self-estimates uncertainty for each 6D pose using a novel output architecture.
Integrates additional information without affecting CNN performance.
Demonstrates competitive pose performance while providing meaningful uncertainty estimates.
Shows that auxiliary information significantly enhances pose estimation accuracy.

Abstract

Nowadays, image-based object recognition and pose estimation are highly active research areas due to their importance in robotic perception and interaction. While modern CNN-based pose estimators achieve great results, they lack transparency regarding the trustworthiness and precision of individual estimates. This lack of certainty inhibits further processing of the results and deters deployment in production environments due to reliability concerns. As an answer, this thesis proposes a fusion-based approach in which, due to a novel output architecture, the CNN self-estimates the amount of information obtained, resulting in individual 6D uncertainty estimates per 6D pose estimate. Specifically, the CNN predicts the observed object points pixel-wise, along with the precision in the image plane of those predictions. All such gathered perspective information is then fused (without linearization) into a single, globally valid 13 × 13-sized information matrix, which is then regressed to yield the six-dimensional result. This separation allows the CNN to operate solely in image space, whereas the conversion from 2D image space to 6D pose is solved analytically. Additionally, the intermediate result of the globally valid information matrix facilitates the fusion with auxiliary information, such as depth, stereo, and prior knowledge, with ease, as it is simply a 13 × 13 matrix addition. With this approach, the pose is regressed from a fusion of all available data, unlike the more ad hoc approach of combining estimates in postprocessing. Also, the CNN call is wholly unaffected by the addition of these supplemental data. An extensive evaluation of the proposed architecture on multiple benchmark datasets showcases meaningful uncertainty estimates while maintaining competitive pose performance. Also, it shows that adding auxiliary information can significantly improve pose performance, but always relative to the amount of new information gained while maintaining the quality of the estimated uncertainty.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper