Key points are not available for this paper at this time.
How do humans recognize an action or an interaction in the real world? Due to the diversity of viewing perspectives, it is a challenge for humans to identify a regular activity when they observe it from an uncommon perspective. We argue that discriminative spatiotemporal information remains an essential cue for human action recognition. Most existing skeleton-based methods learn optimal representation based on the human-crafted criterion that requires many labeled data and much human effort. This article introduces adaptive skeleton-based neural networks to learn optimal spatiotemporal representation automatically through a data-driven manner. First, an adaptive skeleton representation transformation method (ASRT) is proposed to model view-variation data without hand-crafted criteria. Next, powered by a novel attentional LSTM (C3D-LSTM) encapsulated with 3-D-convolution, the proposed model could effectively enable memory blocks to learn short-term frame dependency and long-term relations. Hence, the proposed model can more accurately understand long-term or complex actions. Furthermore, a data enhancement-driven end-to-end training scheme is proposed to train key parameters under fewer training samples. Enhanced by learned high-performance spatiotemporal representation, the proposed model achieves state-of-the-art performance on five challenging benchmarks.
Yu et al. (Mon,) studied this question.