• VLM-driven active obstacle clearing framework for robotic harvesting. • Target-centric minimal scene graph for structured VLM inputs. • Jointly predicted action types and continuous execution parameters. • Field tests demonstrated 83. 9 rate. • Outperformed rule-based, geometry-based (Geo+Exec), and image-only VLM baselines. Severe occlusions frequently limit robotic fruit harvesting, while active obstacle separation offers a practical route to reach otherwise inaccessible targets. This work presents a VLM-centered perception–understanding–execution framework for active obstacle separation in strawberry harvesting. Synchronized RGB–D sensing produced a target-centric minimal scene graph that encoded spatial and occlusion relations into a compact, execution-oriented interface for a vision–language model (VLM). The VLM analyzed this structured scene description and jointly mapped it to a normalized clearing record action, direction, distance, force, deciding not only whether and what to clear (push, drag, push-and-drag, or direct pick), but also how to clear via continuous parameters. An executability-aware decoder standardized units, enforced minimal-clearance and safety constraints, and synthesized task-space trajectories, turning generic VLM outputs into feasible motions. In simulation, the framework achieved 89. 4% clearing success on a mixed benchmark and up to 92. 7% on single-action push scenes; on the same mixed benchmark, clearing success improved from 60. 0% with a rule-based scene-graph policy and 78. 0% with an image-only VLM baseline to 89. 2%, while reducing mean path length to 0. 55 m and unsafe actions to 4. 8%. In real-robot field trials on the HarvestFlex platform, the same pipeline achieved 83. 9% clearing success and 82. 8% end-to-end pick success under natural plant growth and contact uncertainties, indicating that most cleared scenes translated into successful harvesting. Under identical prompts and thresholds, the standardized scene-graph interface transferred across different VLM backends, with GPT-5 yielding the strongest overall performance among the evaluated models.
Zhao et al. (Sun,) studied this question.