Tunnels, soundproof screens and other vertical roadside traffic facilities play an important role in isolating the driving environment, maintaining driving safety, and reducing driving noise. As the usage time increases, these facade traffic buildings become polluted and cause traffic safety problems. Obstacles on three-dimensional walls of different shapes, colors, and sizes are the most challenging problem in intelligent cleaning environment perception. This paper proposes an obstacle segmentation method based on a visual language model to overcome these problems. Firstly, in the constructed experimental environment, a visual–language obstacle dataset is collected, named the Road-side General Obstacles Dataset (RGOD), and the collected dataset is labeled with both a segmentation mask and a language description. These preprocessing results are used as the training input of the perception model to obtain the foreground and background separation results. Secondly, a VLM-GOS model was proposed to segmentation special-shaped obstacles, which emphasizes the distinction between background and foreground targets. Finally, the general obstacle is segmented by a vision–language model with a similar loss function, and evaluated with different metrics. Experimental results show that compared with models such as MaskFormer, SegFormer, and ASD-Net, this method improves the model’s perceptual ability and increases accuracy by 3%. More importantly, the model is more interpretable.
Guo et al. (Mon,) studied this question.