Recent deep learning-based video watermarking methods excel in non-compression settings but struggle under real H.264/AVC compression due to two key issues: (1) reliance on differentiable simulators that inaccurately model real encoding behaviors—especially quantization and motion estimation—causing a simulation-to-reality gap and poor high-compression robustness; and (2) failure to explicitly account for the frequency-selective attenuation imposed by the H.264 quantization matrix, hindering targeted protection of robust frequency bands. To bridge this gap, we propose CSTFMark (Curriculum-guided Spatio-Temporal Frequency Modulation for Robust Video Watermarking), a framework designed for real H.264/AVC compression. CSTFMark employs a 3D U-Net backbone and a Three-Stage Curriculum Learning strategy (TPC-H264) that progressively trains on undistorted videos, simulated compression, and finally real non-differentiable H.264 streams—stabilized by momentum-based gradient continuation. It features a Spatio-temporal Dynamic Frequency Modulation module (ST-DFM) to adaptively enhance compression-resilient frequencies, and a Hyper-prior Guided Embedding mechanism (HGE) for semantics-aware watermark modulation. Experiments on Kinetics-400 and UCF-101 show CSTFMark significantly outperforms state-of-the-art methods, maintaining high extraction accuracy even at high CRF. Thanks to a multi-scale spatiotemporal perception loss, watermarked videos retain excellent visual fidelity, temporal coherence, and support HD resolution with 32-frame inputs. CSTFMark offers a practical, scalable solution for robust video watermarking in real encoding pipelines.
Zhu et al. (Tue,) studied this question.