The visuomotor policy can easily overfit to its training datasets, such as fixed camera positions and backgrounds. This overfitting makes the policy perform well in the in-distribution scenarios but underperform in the out-of-distribution generalization. Additionally, the existing methods also have difficulty fusing multi-view information to generate an effective 3D representation. To tackle these issues, we propose Omni-Vision Diffusion Policy (OmniD), a multi-view fusion framework that synthesizes image observations into a unified bird's-eye view (BEV) representation. We introduce a deformable attention-based Omni-Feature Generator (OFG) to selectively abstract task-relevant features while suppressing view-specific noise and background distractions. OmniD achieves 11\%, 17\%, and 84\% average improvement over the best baseline model for in-distribution, out-of-distribution, and few-shot experiments, respectively. Training code and simulation benchmark are available: https://github.com/1mather/omnid.git
Building similarity graph...
Analyzing shared references across papers
Loading...
Jilei Mao
Jiao Guan
Yingjuan Tang
Building similarity graph...
Analyzing shared references across papers
Loading...
Mao et al. (Sat,) studied this question.
www.synapsesocial.com/papers/68d6e16f8b2b6861e4c40216 — DOI: https://doi.org/10.48550/arxiv.2508.11898