Inspired by the impressive success of image-text foundation models, recent works have proposed to adapt these foundation models to video data, leading to efficient and effective video models for open-vocabulary action recognition. However, through a comprehensive evaluation, our work finds that state-of the-art open-vocabulary action recognition models still struggle with generalization to video domains that they have not en countered. To address this limitation, we introduce generalizable open-vocabulary action recognition, which aims to develop action recognition models capable of generalizing to both novel action categories and unseen video domains. Our work contributes a novel model named XOV-Action to overcome two critical challenges: (1) understanding novel action concepts of open-set categories, and (2) mitigating the scenario discrepancy between training and test datasets. Specifically, XOV-Action first proposes to capture diverse action-related concepts by learning diversified elaboration representations, which enables better generalization to open-set action categories. Second, XOV-Action learns scene agnostic video representations to overcome the scene bias, which improves the generalization in unseen video domains. Addition ally, to evaluate models in generalizable open-vocabulary action recognition, we contribute a new cross-domain action benchmark named XOVABench, which covers multiple video domains with varying degrees of gaps and consists of both closed-set and open-set action categories. Extensive quantitative and qualitative experiments demonstrate that our proposed XOV-Action can effectively improve the action recognition performance for both closed-set and open-set categories across video domains.
Lin et al. (Thu,) studied this question.