What question did this study set out to answer?

The aim is to develop a VLM-driven framework for effective obstacle clearing in robotic harvesting tasks.

March 26, 2026Open Access

Active Obstacle Separation: Vision-Language Model (VLM) Driven Clearing Decisions for Robotic Harvesting

Key Points

The aim is to develop a VLM-driven framework for effective obstacle clearing in robotic harvesting tasks.
Developed a target-centric minimal scene graph for VLM inputs.
Utilized synchronized RGB-D sensing to encode spatial and occlusion relations.
Established a framework to predict actions and parameters for obstacle clearing.
Conducted simulations and real-field tests on the HarvestFlex platform.
Achieved 83.9% clearing success in field trials.
Increased simulation clearing success to 89.4%, with up to 92.7% for single-action push scenes.
Improved clearing success from 60.0% (rule-based) and 78.0% (image-only VLM) to 89.2% using the proposed framework.
Reduced mean path length to 0.55 m and unsafe actions to 4.8%.

Abstract

• VLM-driven active obstacle clearing framework for robotic harvesting. • Target-centric minimal scene graph for structured VLM inputs. • Jointly predicted action types and continuous execution parameters. • Field tests demonstrated 83. 9 rate. • Outperformed rule-based, geometry-based (Geo+Exec), and image-only VLM baselines. Severe occlusions frequently limit robotic fruit harvesting, while active obstacle separation offers a practical route to reach otherwise inaccessible targets. This work presents a VLM-centered perception–understanding–execution framework for active obstacle separation in strawberry harvesting. Synchronized RGB–D sensing produced a target-centric minimal scene graph that encoded spatial and occlusion relations into a compact, execution-oriented interface for a vision–language model (VLM). The VLM analyzed this structured scene description and jointly mapped it to a normalized clearing record action, direction, distance, force, deciding not only whether and what to clear (push, drag, push-and-drag, or direct pick), but also how to clear via continuous parameters. An executability-aware decoder standardized units, enforced minimal-clearance and safety constraints, and synthesized task-space trajectories, turning generic VLM outputs into feasible motions. In simulation, the framework achieved 89. 4% clearing success on a mixed benchmark and up to 92. 7% on single-action push scenes; on the same mixed benchmark, clearing success improved from 60. 0% with a rule-based scene-graph policy and 78. 0% with an image-only VLM baseline to 89. 2%, while reducing mean path length to 0. 55 m and unsafe actions to 4. 8%. In real-robot field trials on the HarvestFlex platform, the same pipeline achieved 83. 9% clearing success and 82. 8% end-to-end pick success under natural plant growth and contact uncertainties, indicating that most cleared scenes translated into successful harvesting. Under identical prompts and thresholds, the standardized scene-graph interface transferred across different VLM backends, with GPT-5 yielding the strongest overall performance among the evaluated models.

Active Obstacle Separation: Vision-Language Model (VLM) Driven Clearing Decisions for Robotic Harvesting

Key Points

Abstract

Cite This Study