Video coding plays a critical role for efficient transmission in surveillance camera sensors. Although long-term reference (LTR) has been fully studied in traditional hand-designed video coding approaches, its potential in learned video coding is still unexplored due to the highly unequal importance between long and short motion and the excessive motion overhead, especially for dense motion representation, e.g., optical flow. In this paper, we build an LTR baseline for learned surveillance video coding and propose an adaptive long–short modeling approach to address the above problem. Specifically, we first introduce LTR and propose a long–short context mining module to the authorized end-to-end video coding exploration model (EEM) from China’s AVS as a baseline. Since the quality of LTR significantly impacts its performance and importance, it is subsequently enhanced. Then, we propose a long–short motion adapter to address the unequal importance. Finally a historical motion guidance module is introduced to aid the motion decoding. Experimental results demonstrate that the proposed approach improves from a 1.86% BD-rate loss on EEM-4.1 to 13.89% BD-rate savings in YUV-PSNR compared with the anchor H.266/VVC under a low-delay P configuration. Although the current results are not comparable to the 44.01% gains of DCVC-FM, the proposed approach consumes less computational resources and we believe that integrating the proposed LTR method with stronger baselines will further boost the performance.
Wu et al. (Thu,) studied this question.