What question did this study set out to answer?

The aim is to improve pedestrian crossing intention prediction by enhancing existing models' generalization and robustness.

February 26, 2026Open Access

VLMPed-CoT: A Large Vision-Language Model with Chain-of-Thought Mechanism for Pedestrian Crossing Intention Prediction

Puntos clave

The aim is to improve pedestrian crossing intention prediction by enhancing existing models' generalization and robustness.
Developed LVLMPed-CoT, a large vision-language model with a chain-of-thought mechanism.
Incorporated multimodal data as input for prediction.
Utilized data distillation and a two-stage fine-tuning strategy.
Trained on joint open-source datasets (JAAD and PIE).
Conducted an ablation study to evaluate prompt design and fine-tuning impact.
Achieved superior or comparable performance to state-of-the-art models on public datasets.
Validated the importance of the CoT prompt design in model performance.
Analyzed the effects of input data sequence length and image quality on accuracy and inference time.

Resumen

Pedestrian crossing intention prediction is crucial for autonomous driving. While existing models have achieved high accuracy, their generalization and robustness remain limited, hindering their performance in real-world scenarios. To overcome these limitations, we introduce the LVLMPed-CoT, a large vision language model (LVLM) that incorporates a chain-of-thought (CoT) mechanism to enhance pedestrian crossing intention prediction. It takes multimodal data as input and employs data distillation along with a two-stage fine-tuning strategy to elicit the implicit CoT capability of a lightweight vision-language model for enhanced perception, reasoning, an d prediction. The unified LVLMPed-CoT is trained on a joint open-source dataset (JAAD and PIE) and achieves superior or comparable performance to state-of-the-art models on both large-scale public datasets. The ablation study validates the contribution of the CoT prompt design and the two-stage fine-tuning strategy to the model's performance. Further analysis investigates the impact of input data sequence length and image quality on both accuracy and inference time, as well as the interpretability of the enhanced CoT reasoning ability achieved through fine-tuning.

Me gusta

Guardar

Ver artículo completo