What question did this study set out to answer?

The aim is to improve traffic scene perception under challenging conditions using a multimodal framework.

February 26, 2026Open Access

TrafficPerceiver: A Multimodal Large Language Model with Reinforcement Learning for Unified Challenging Traffic Scene Perception

Key Points

The aim is to improve traffic scene perception under challenging conditions using a multimodal framework.
Developed a multimodal large language model (MLLM) for traffic scene perception.
Introduced reinforcement learning with group-relative policy optimization (GRPO) to enhance performance.
Constructed the challenging traffic scene understanding (CTSU) dataset with dense annotations for experiments.
Conducted experiments on DRAMA-ROLISP and CTSU datasets to evaluate model performance.
Achieved state-of-the-art performance in understanding and segmentation tasks.
Demonstrated improved fine-grained perception under adverse conditions like rain and fog.
Showed enhanced ability to follow human instructions in traffic scenarios.

Abstract

Understanding traffic scenes under diverse and challenging conditions is critical for intelligent transportation systems (ITS). Existing methods primarily focus on ideal scenarios and often lack the ability to perform fine-grained perception or respond to human instructions. To address these limitations, we propose TrafficPerceiver, a unified multimodal framework based on a multimodal large language model (MLLM) that jointly supports both image understanding and target-oriented segmentation. To enhance the model’s performance under adverse conditions such as rain, fog, and motion blur, we introduce a reinforcement learning optimization strategy based on group-relative policy optimization (GRPO), which encourages interpretable, instruction-following behavior. Additionally, we construct the challenging traffic scene understand ing (CTSU) dataset, a large-scale dataset tailored to challenging traffic environments, with dense annotations for both segmentation and instruction-response tasks. Extensive experiments on both the DRAMA-ROLISP and CTSU datasets demonstrate that TrafficPerceiver achieves state-of-the-art performance in both understanding and segmentation tasks.

Bookmark

View Full Paper

Bookmark

View Full Paper

TrafficPerceiver: A Multimodal Large Language Model with Reinforcement Learning for Unified Challenging Traffic Scene Perception

Key Points

Abstract

Cite This Study