Although underwater gliders are highly energy-efficient platforms capable of long-duration and large-scale ocean observation, their lack of self-propulsion requires external assistance for recovery upon mission completion. In harsh and dynamic marine environments, reliably detecting the glider and accurately estimating its three-dimensional position are critical to ensuring the recovery operations are safe and efficient. This paper proposes a perception framework based on deep learning to detect underwater glider hulls and estimate their three-dimensional relative positions using camera–sonar multi-sensor fusion. This approach integrates a hierarchical convolutional neural network (CNN) vision encoder and a transformer-based architecture to estimate the glider’s spatial location and heading direction simultaneously. The hierarchical CNN encoder extracts multi-level, semantically rich visual features, thereby improving robustness to visual degradation and environmental disturbances common in underwater settings. Additionally, the transformer incorporates a variable mixture-of-experts (vMoE) mechanism that adaptively allocates expert networks across layers, enhancing representational capacity while maintaining computational efficiency. The resulting pose estimates enable precise, collision-free ROV navigation for automated recovery and onboard sensor inspection tasks. Experimental results, including ablation studies, validate the effectiveness of the proposed components and demonstrate their contributions to accurate glider hull detection and three-dimensional localization. Overall, the proposed framework provides a scalable, reliable perception solution that allows for the safe, autonomous recovery of underwater gliders with an ROV in realistic ocean environments.
Lee et al. (Thu,) studied this question.