With recent advancements in maritime technology and the increasing use of USVs (Unmanned Surface Vehicles), the importance of precise, safe autonomous berthing technology for realizing fully autonomous navigation is emerging. Currently, LiDAR (Light Detection And Ranging), commonly used for USVs, is expensive, while electronic charts have limitations in reflecting the real-time dynamic environment. To overcome these limitations, this study proposes a method for generating a 3D environmental map for autonomous berthing of USVs using a low-cost stereo camera that operates robustly in maritime environments. The core novelty of this study lies in the design of a ‘maritime-specific unified filtering pipeline’ rather than a simple combination of existing deep learning models. Initially, a 2.5D depth map and corresponding masks were generated using vision foundation models to extract robust features in textureless maritime environments. Subsequently, the proposed multi-stage postprocessing pipeline, integrating geometric constraints based on camera parameters, SOR (Statistical Outlier Removal), and voxel-grid downsampling, was applied. This pipeline fundamentally transforms the unstable raw outputs of deep learning models into deterministic sensor data, represented as highly reliable 3D point clouds, by effectively eliminating physically impossible data and ghost obstacles. To verify the effectiveness of the proposed method, virtual environments precisely simulating the real world were constructed. The comparative results showed that FoundationStereo, unlike models such as CREStereo (Cascaded REcurrent Stereo) and Depth Anything V2, demonstrated superior performance in precisely restoring fine structures, such as ship masts, and complex distant skylines without distortion. Specifically, the proposed method achieved MAEs (Mean Absolute Errors) of 0.12 m, 0.11 m, and 0.08 m for 3D point positions across three scenarios, representing reductions of up to 94.62% compared to the comparative methods. Furthermore, the generated 3D environmental map was converted into a 2D BEV (Bird’s-Eye-View) map, confirming that the USVs can intuitively perceive the horizontal positional relationships of obstacles and utilize them for route planning.
Yeo et al. (Sun,) studied this question.