What does this research mean for the field?

The proposed transformer-based model significantly improves action recognition and localization performance in video understanding tasks by effectively capturing both intra- and inter-video contextual information. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to devise a framework that enhances action spotting by leveraging both intra- and inter-video context.

March 12, 2026Open Access

Intra- and inter-video context-aware action spotting

Key Points

The aim is to devise a framework that enhances action spotting by leveraging both intra- and inter-video context.
Proposed a transformer-based model to treat action spotting as a set prediction problem.
Captured long-range intra-video temporal information to identify causal relationships between actions.
Designed an action memory module to store features of action classes during training for better recognition and localization.
The model significantly outperforms existing state-of-the-art methods on public benchmarks.
Improvements seen in both action recognition accuracy and localization performance.

Abstract

As a recently proposed task on video understanding, action spotting aims to locate and classify action in a long video clip and it can be applied for automatically generating video summaries and highlights. To resolve this problem, this article proposes a novel framework capable of capturing both intra- and inter-video contextual information. In particular, we present a transformer based model that views the task as a set prediction problem that aims to match the set of predicted action instances and the set of ground-truths. It is able to capture long-range intra-video temporal information and discovers causal relationships between actions. Next, based on the observation that actions of the same type recur in different videos, we propose to exploit the inter-video contextual information from dataset. To do so, we design an action memory module which stores the compact feature representation of each action class during training, so as to improve action recognition and localization performance. We evaluate our model on public benchmark and demonstrate that our model outperforms the state-of-the-art methods by a large margin.

Bookmark

View Full Paper

Bookmark

View Full Paper

Intra- and inter-video context-aware action spotting

Key Points

Abstract

Cite This Study