What question did this study set out to answer?

The aim is to develop an automated system for accurately detecting and determining the pose of intermeshed steel components using RGB images.

June 20, 2026Open Access

ConPose: a jointly trained, single-pass RGB detection-and-pose framework for intermeshed steel connections

Key Points

The aim is to develop an automated system for accurately detecting and determining the pose of intermeshed steel components using RGB images.
Developed a CNN-based detection framework that processes RGB inputs in a single pass.
Utilized a fixed edge bank and class-conditioned feature-wise linear modulation for robustness.
Evaluated performance on a six-class intermeshed steel connections dataset with precise pose annotations.
ConPose achieved mAP of 89.3% at 50 IoU and 71.5% at 75 IoU.
Pose accuracy measured with ADD(S)@0.1 at 73.4% and ADD(S)@0.2 at 88.1%.
Median rotation was 2.0° and median translation was 3.8 cm, outperforming re-trained RGB baselines.

Abstract

Abstract Reliable automated assembly of intermeshed steel components (ISC) needs perception that can identify parts, localise parts, and recover their full 6-DoF pose from RGB alone. We present ConPose, a jointly trained, single-pass RGB detection-and-pose framework for ISC. A CNN-based detector proposes boxes; for each box, we crop pixels with the exact crop affine and feed a lightweight pose head that directly regresses rotation and translation. The key idea is that all geometry terms are written in the same region-of-interest (ROI) coordinate frame as the pose head, using the camera intrinsics and the crop affine, so that supervision is aligned with the coordinates actually consumed during prediction. To improve robustness under visually challenging ISC conditions, the head augments the ROI with a fixed (parameter-free) edge bank and applies class-conditioned feature-wise linear modulation (FiLM) on a small set of learned queries. Our approach uses no PnP, no iterative refinement, and no depth. We evaluate on a six-class ISC dataset with bounding boxes and precise 6-DoF poses. ConPose delivers strong accuracy in a single forward pass: mAP ₅₀=89. 3\%, mAP ₇₅=71. 5\%, ADD (S) @0. 1 d=73. 4\%, ADD (S) @0. 2 d=88. 1\%, with median rotation 2. 0^ and median translation 3. 8 cm. Compared to re-trained RGB baselines, ConPose yields higher pose accuracy while maintaining competitive detection. Ablations show that ROI-frame supervision, the fixed edge bank, and class-conditioned queries each provide clear gains.

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper