Understanding traffic scenes under diverse and challenging conditions is critical for intelligent transportation systems (ITS). Existing methods primarily focus on ideal scenarios and often lack the ability to perform fine-grained perception or respond to human instructions. To address these limitations, we propose TrafficPerceiver, a unified multimodal framework based on a multimodal large language model (MLLM) that jointly supports both image understanding and target-oriented segmentation. To enhance the model’s performance under adverse conditions such as rain, fog, and motion blur, we introduce a reinforcement learning optimization strategy based on group-relative policy optimization (GRPO), which encourages interpretable, instruction-following behavior. Additionally, we construct the challenging traffic scene understand ing (CTSU) dataset, a large-scale dataset tailored to challenging traffic environments, with dense annotations for both segmentation and instruction-response tasks. Extensive experiments on both the DRAMA-ROLISP and CTSU datasets demonstrate that TrafficPerceiver achieves state-of-the-art performance in both understanding and segmentation tasks.
Kuang et al. (Sun,) studied this question.