As a recently proposed task on video understanding, action spotting aims to locate and classify action in a long video clip and it can be applied for automatically generating video summaries and highlights. To resolve this problem, this article proposes a novel framework capable of capturing both intra- and inter-video contextual information. In particular, we present a transformer based model that views the task as a set prediction problem that aims to match the set of predicted action instances and the set of ground-truths. It is able to capture long-range intra-video temporal information and discovers causal relationships between actions. Next, based on the observation that actions of the same type recur in different videos, we propose to exploit the inter-video contextual information from dataset. To do so, we design an action memory module which stores the compact feature representation of each action class during training, so as to improve action recognition and localization performance. We evaluate our model on public benchmark and demonstrate that our model outperforms the state-of-the-art methods by a large margin.
Chen et al. (Mon,) studied this question.