Unmanned surface vessels (USVs) equipped with onboard vision are increasingly used in environmental monitoring, search and rescue, and autonomous navigation. However, conventional USV autonomy systems often adopt a decoupled design in which target perception and disturbance estimation are developed independently. Such systems may suffer performance degradation when visual observations become unreliable under water-surface reflections, illumination variations, or partial occlusions, while the disturbance observer still depends on manually tuned parameters under time-varying environmental disturbances. To address these issues, this paper proposes a three-stage co-optimized target perception and disturbance estimation framework for USVs. First, a lightweight hybrid convolutional neural network (CNN)–Transformer perception module is developed to extract robust vessel features under challenging water-surface visual conditions. Second, a reinforcement learning (RL)-driven mechanism is used to adaptively tune a higher-order sliding mode observer (HOSMO) for disturbance estimation. Third, a confidence-guided perception-observer co-optimization strategy is formulated, in which visual confidence is used to regulate observer adaptation and reduce estimation divergence during temporary perception degradation. Simulation and outdoor lake experiments demonstrate that the proposed framework improves visual matching accuracy, observer convergence, and estimation stability compared with conventional decoupled methods. The outdoor lake experiments provide initial real-world validation under natural illumination variations and mild water-surface disturbances, while further open-water and multi-vessel validation is planned for future work.
Shi et al. (Sat,) studied this question.