Recent advances in vision–language models (VLMs) have transformed the field of robotics. Researchers are combining the reasoning capabilities of large language models (LLMs) with the visual information processing capabilities of VLMs in various domains. However, most efforts have focused on terrestrial robots and are limited in their applicability to volatile environments such as ocean surfaces and underwater environments, where real-time judgment is required. We propose a system integrating the cognition, decision making, path planning, and control of autonomous marine surface vehicles in the ROS2–Gazebo simulation environment using a multimodal vision–LLM system with zero-shot prompting for real-time adaptability. In 30 experiments, adding the path plan mode feature increased the success rate from 23% to 73%. The average distance increased from 39 m to 45 m, and the time required to complete the task increased from 483 s to 672 s. These results demonstrate the trade-off between improved reliability and reduced efficiency. Experiments were conducted to verify the effectiveness of the proposed system and evaluate its performance with and without adding a path-planning step. The final algorithm with the path-planning sub-process yields a higher success rate, and better average path length and time. We achieve real-time environmental adaptability and performance improvement through prompt engineering and the addition of a path-planning sub-process in a limited structure, where the LLM state is initialized with every application programming interface call (zero-shot prompting). Additionally, the developed system is independent of the vision–LLM archetype, making it scalable and adaptable to future models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Taeyeon Kim
Yonsei University
Woen-Sug Choi
Korea Maritime and Ocean University
Journal of Marine Science and Engineering
Korea Maritime and Ocean University
Building similarity graph...
Analyzing shared references across papers
Loading...
Kim et al. (Wed,) studied this question.
synapsesocial.com/papers/68a3669b0a429f797332c31b — DOI: https://doi.org/10.3390/jmse13081553