Vision-language models (VLMs) offer transformative potential for robotics, but their deployment is constrained by performance limitations. In safety-critical manipulation, a model must recognize its own limitations to prevent a catastrophic failure. We conduct a systematic study of VLMs for robotic failure detection, evaluating six architectures on real-world trajectories. We put forward a decision-making process that allows a VLM to evaluate whether it can successfully complete a task, and if not, pause its operation and hand over the task to human operators. Our results show that well-calibrated VLMs can be trustworthy partners that know exactly when to ask for help.
Building similarity graph...
Analyzing shared references across papers
Loading...
Md Sameer Iqbal Chowdhury
Tsz-Chiu Au
Building similarity graph...
Analyzing shared references across papers
Loading...
Chowdhury et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69e1cfe05cdc762e9d858eec — DOI: https://doi.org/10.1109/mpuls.2026.3659245