What question did this study set out to answer?

To establish a deep learning framework for detecting underwater glider hulls and estimating their three-dimensional positions.

March 7, 2026Open Access

A Study of the Three-Dimensional Localization of an Underwater Glider Hull Using a Hierarchical Convolutional Neural Network Vision Encoder and a Variable Mixture-of-Experts Transformer

Key Points

To establish a deep learning framework for detecting underwater glider hulls and estimating their three-dimensional positions.
Utilized a hierarchical convolutional neural network for feature extraction.
Implemented a transformer-based architecture with a variable mixture-of-experts mechanism.
Conducted experiments including ablation studies to validate the framework.
Achieved accurate detection of glider hulls in dynamic marine environments.
Estimated three-dimensional positions of gliders effectively.
Demonstrated scalability and reliability of the perception framework for autonomous recovery.

Abstract

Although underwater gliders are highly energy-efficient platforms capable of long-duration and large-scale ocean observation, their lack of self-propulsion requires external assistance for recovery upon mission completion. In harsh and dynamic marine environments, reliably detecting the glider and accurately estimating its three-dimensional position are critical to ensuring the recovery operations are safe and efficient. This paper proposes a perception framework based on deep learning to detect underwater glider hulls and estimate their three-dimensional relative positions using camera–sonar multi-sensor fusion. This approach integrates a hierarchical convolutional neural network (CNN) vision encoder and a transformer-based architecture to estimate the glider’s spatial location and heading direction simultaneously. The hierarchical CNN encoder extracts multi-level, semantically rich visual features, thereby improving robustness to visual degradation and environmental disturbances common in underwater settings. Additionally, the transformer incorporates a variable mixture-of-experts (vMoE) mechanism that adaptively allocates expert networks across layers, enhancing representational capacity while maintaining computational efficiency. The resulting pose estimates enable precise, collision-free ROV navigation for automated recovery and onboard sensor inspection tasks. Experimental results, including ablation studies, validate the effectiveness of the proposed components and demonstrate their contributions to accurate glider hull detection and three-dimensional localization. Overall, the proposed framework provides a scalable, reliable perception solution that allows for the safe, autonomous recovery of underwater gliders with an ROV in realistic ocean environments.

A Study of the Three-Dimensional Localization of an Underwater Glider Hull Using a Hierarchical Convolutional Neural Network Vision Encoder and a Variable Mixture-of-Experts Transformer

Key Points

Abstract

Cite This Study