Abstract Modern manufacturing increasingly relies on robotics to achieve high throughput and quality, especially as production lines become more flexible and parts more customized. Robotic inspection is a critical enabler for quality assurance as it supports repeatable measurements while reducing human workload and variability. Recent vision-language-action (VLA) models have advanced robotic manipulation by integrating visual perception and language understanding for autonomous control. However, the application to robotic inspection, which requires accurate movement without altering the environment, remains underexplored. This work investigates the feasibility of adapting manipulation-pretrained VLA models to an inspection-oriented feature-following task and presents the following contributions: Tailored to the requirements of inspection problems, two approaches for assessing VLA performance are introduced: A trajectory-based evaluation metric to quantify performance in rollouts as well as an action-level metric, useful during the fine-tuning process. In addition, an open-source, manipulation-pretrained VLA model is fine-tuned for a feature-following task. This task represents a simplified 2D inspection setting, designed to capture core aspects of inspection problems encountered in domains such as manufacturing and infrastructure. The model successfully executes these complex feature-following trajectories with competitive performance relative to a human operator in a real-world robotic setup, demonstrating effective transfer from manipulation to this class of inspection tasks. While the study is limited in scale, the results provide initial evidence that VLA models can be extended beyond manipulation to support feature perception and motion generation in automated, robotic inspection. This suggests their potential to support more consistent and automated inspection processes, motivating further investigation into robustness and generalization.
Krüger et al. (Fri,) studied this question.