Abstract Reliable automated assembly of intermeshed steel components (ISC) needs perception that can identify parts, localise parts, and recover their full 6-DoF pose from RGB alone. We present ConPose, a jointly trained, single-pass RGB detection-and-pose framework for ISC. A CNN-based detector proposes boxes; for each box, we crop pixels with the exact crop affine and feed a lightweight pose head that directly regresses rotation and translation. The key idea is that all geometry terms are written in the same region-of-interest (ROI) coordinate frame as the pose head, using the camera intrinsics and the crop affine, so that supervision is aligned with the coordinates actually consumed during prediction. To improve robustness under visually challenging ISC conditions, the head augments the ROI with a fixed (parameter-free) edge bank and applies class-conditioned feature-wise linear modulation (FiLM) on a small set of learned queries. Our approach uses no PnP, no iterative refinement, and no depth. We evaluate on a six-class ISC dataset with bounding boxes and precise 6-DoF poses. ConPose delivers strong accuracy in a single forward pass: mAP ₅₀=89. 3\%, mAP ₇₅=71. 5\%, ADD (S) @0. 1 d=73. 4\%, ADD (S) @0. 2 d=88. 1\%, with median rotation 2. 0^ and median translation 3. 8 cm. Compared to re-trained RGB baselines, ConPose yields higher pose accuracy while maintaining competitive detection. Ablations show that ROI-frame supervision, the fixed edge bank, and class-conditioned queries each provide clear gains.
Adebayo et al. (Thu,) studied this question.