ABSTRACT Autonomous vehicles (AVs) and advanced driver assistance systems (ADAS) continue to advance, yet effective social coordination with human road users (HRUs) remains a key challenge. This study introduces the PGR‐Net model, a spatiotemporal deep learning (DL) approach for pedestrian gesture recognition (PGR) to bridge the gap in AV‐pedestrian communication. We created the PGR‐Net v1.0 dataset by remapping Jester gesture labels to AV‐relevant classes: Stop, Go, and Greeting/Thanking. Furthermore, a No Gesture class is defined via a sequential hand‐presence rule. The PGR‐Net fuses an R(2+1)D, a three‐dimensional convolutional neural network (3D‐CNN) architecture, and a spatiotemporal stream with hand‐pose landmarks, followed by recurrent neural network (RNN) encoders and self‐attention layers to emphasise gesture‐relevant frames. On the PGR‐Net v1.0 dataset, the PGR‐Netv2 achieves 88.29% accuracy and an absolute 12.56% improvement from the baseline R(2+1)D model. Qualitative tests on single images beyond the dataset indicate sensible generalisation and highlight the importance of short spatiotemporal context for PGR. These results suggest that hand‐augmented spatiotemporal modelling is a viable path toward a robust and AV‐relevant PGR for various traffic scenarios. We discuss current limitations due to the limited availability of PGR‐specific datasets and outline directions for broader in‐the‐wild data and context‐aware modelling to improve applicability.
Mahdi et al. (Thu,) studied this question.