What question did this study set out to answer?

This work aims to improve action recognition models' ability to generalize to novel action categories and unseen video domains.

June 19, 2026

XOV-Action: Towards Generalizable Open-Vocabulary Action Recognition

Key Points

This work aims to improve action recognition models' ability to generalize to novel action categories and unseen video domains.
Developed a model named XOV-Action focused on generalizable open-vocabulary action recognition.
Proposed learning diversified elaboration representations for understanding novel actions.
Learned scene agnostic video representations to mitigate scene bias.
XOV-Action shows improved action recognition performance for closed-set categories across various video domains.
For open-set categories, XOV-Action enhances generalization compared to state-of-the-art models.

Abstract

Inspired by the impressive success of image-text foundation models, recent works have proposed to adapt these foundation models to video data, leading to efficient and effective video models for open-vocabulary action recognition. However, through a comprehensive evaluation, our work finds that state-of the-art open-vocabulary action recognition models still struggle with generalization to video domains that they have not en countered. To address this limitation, we introduce generalizable open-vocabulary action recognition, which aims to develop action recognition models capable of generalizing to both novel action categories and unseen video domains. Our work contributes a novel model named XOV-Action to overcome two critical challenges: (1) understanding novel action concepts of open-set categories, and (2) mitigating the scenario discrepancy between training and test datasets. Specifically, XOV-Action first proposes to capture diverse action-related concepts by learning diversified elaboration representations, which enables better generalization to open-set action categories. Second, XOV-Action learns scene agnostic video representations to overcome the scene bias, which improves the generalization in unseen video domains. Addition ally, to evaluate models in generalizable open-vocabulary action recognition, we contribute a new cross-domain action benchmark named XOVABench, which covers multiple video domains with varying degrees of gaps and consists of both closed-set and open-set action categories. Extensive quantitative and qualitative experiments demonstrate that our proposed XOV-Action can effectively improve the action recognition performance for both closed-set and open-set categories across video domains.

Demander à l'IA

Bookmark

Cite This Study

Lin et al. (Thu,) studied this question.

synapsesocial.com/papers/6a34dc0f65a5b0777af2c799 https://doi.org/https://doi.org/10.1109/tpami.2026.3704589

Demander à l'IA

Bookmark