What question did this study set out to answer?

The research aims to evaluate vision-language models for their effectiveness in detecting failures during robotic tasks.

April 17, 2026

Assessing Vision-Language Models for Failure Detection in Robotic Manipulation.

Key Points

The research aims to evaluate vision-language models for their effectiveness in detecting failures during robotic tasks.
Conducted a systematic study of six vision-language model architectures.
Evaluated models on real-world robotic manipulation trajectories.
Developed a decision-making process for task evaluation and operator handover.
Well-calibrated vision-language models can accurately assess their performance limits.
Models effectively initiate task pauses and communicate with human operators when unable to proceed.

Abstract

Vision-language models (VLMs) offer transformative potential for robotics, but their deployment is constrained by performance limitations. In safety-critical manipulation, a model must recognize its own limitations to prevent a catastrophic failure. We conduct a systematic study of VLMs for robotic failure detection, evaluating six architectures on real-world trajectories. We put forward a decision-making process that allows a VLM to evaluate whether it can successfully complete a task, and if not, pause its operation and hand over the task to human operators. Our results show that well-calibrated VLMs can be trustworthy partners that know exactly when to ask for help.

KI fragen

Bookmark