Key points are not available for this paper at this time.
Abstract Owing to the substantial time and labor demands associated with video annotation for fully-supervised temporal action detection (TAD), extensive research has been devoted to the domain of weakly-supervised TAD. However, existing weakly-supervised TAD approaches still suffer from severe localization errors due to the absence of fine-grained frame-level annotations. To tackle this issue, single-frame supervised TAD has been recently proposed as a potential method. This paper does not introduce a new approach. Instead, the aim of this paper is to conduct an empirical study on factors of influence for single-frame supervised TAD, which have not yet been studied and thus are still unclear. We go back to basics and investigate the effects of several fundamental components on the performance of single-frame supervised TAD: 1) feature extraction, 2) feature modeling, 3) temporal embedding, 4) classification head, and 5) video-level classification loss. In this investigation, we explore the potentials of traditional technical solutions in the task of single-frame supervised TAD and unveil the benefits of such solutions, which have not yet been reported to the research community. Based on the findings, we build a baseline detector, which achieves the state-of-the-art performance. It should be noted that, to make up for the limit of mAP (mean average precision), not only mAP but also VCCR (video-level classification correctness rate) is employed in the performance evaluation. Make a note of the fact that the VCCR is a supplementary metric supporting the mAP. We hope that our work can facilitate future research in this field.
Jo et al. (Mon,) studied this question.