What question did this study set out to answer?

The aim is to optimize the placement and dynamic reconfiguration of underwater sonar networks using deep reinforcement learning to enhance maritime monitoring.

April 17, 2026Open Access

Environment-Aware Optimal Placement and Dynamic Reconfiguration of Underwater Robotic Sonar Networks Using Deep Reinforcement Learning

Key Points

The aim is to optimize the placement and dynamic reconfiguration of underwater sonar networks using deep reinforcement learning to enhance maritime monitoring.
Formulated sonar placement as a finite-horizon Markov decision process.
Trained a Proximal Policy Optimization agent for optimal sensor layout.
Estimated flow-aware motion costs using a PPO with Long Short-Term Memory.
Implemented a zero-element assignment method for optimal AUV reassignment.
Conducted simulation studies to validate performance against traditional methods.
Achieved a final reward 16-21% higher than Particle Swarm Optimization and 2-3.7% higher than Genetic Algorithm.
Reduced average travel time by 30.44% compared to A* algorithm.
Supported scalable fleet operations and adaptive DCLT under dynamic conditions.

Abstract

Underwater dynamic target detection, classification, localization, and tracking (DCLT) is central to maritime surveillance and monitoring and increasingly relies on distributed AUV-based robotic sonar networks operating in passive listening and, when required, cooperative multistatic modes. Achieving a robust performance in realistic oceans remains challenging, because sensor placement must adapt to time-varying acoustic conditions and target priors while preserving acoustic communication connectivity, and because frequent reconfiguration under dynamic currents makes classical large-scale planning computationally expensive. This paper presents an integrated deep reinforcement learning (DRL)-based framework for passive-stage sonar placement and dynamic reconfiguration in distributed AUV networks. First, we cast placement as a constructive finite-horizon Markov decision process (MDP) and train a Proximal Policy Optimization (PPO) agent to sequentially build a collision-free layout on a discretized surveillance grid. The terminal reward is formulated to jointly optimize the environment-aware detection performance, computed from BELLHOP-based transmission loss models, and global network connectivity, quantified using algebraic connectivity. Second, to enable time-critical reconfiguration, we estimate flow-aware motion costs for all AUV–destination pairs using a PPO with a Long Short-Term Memory (LSTM) trajectory policy trained for partial observability. The learned policy can be deployed onboard, allowing each AUV to refine its path online using locally sensed currents, improving robustness to ocean-model uncertainty. The resulting cost matrix is solved via an efficient zero-element assignment method to obtain the optimal one-to-one reassignment. In the reported simulation studies, the proposed Sequential PPO placement method achieves a final reward 16–21% higher than Particle Swarm Optimization (PSO) and 2–3.7% higher than the Genetic Algorithm (GA), while the proposed PPO + LSTM planner reduces average travel time by 30.44% compared with A*. The proposed closed-loop architecture supports frequent re-optimization, scalable fleet operation, and a seamless transition to communication-supported cooperative multistatic tracking after detection, enabling efficient, adaptive DCLT in dynamic marine environments.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper