March 3, 2026Open Access

CSTFMark: robust video watermarking against H.264/AVC compression via dynamic spati-temporal frequency modulation and curriculum learning

Key Points

Maintaining high extraction accuracy is key, especially at high compression rates, as shown in experiments on Kinetics-400 and UCF-101.
CSTFMark employs a 3D U-Net backbone and a Three-Stage Curriculum Learning strategy to enhance its robustness against H.264 encoding.
The Spatio-temporal Dynamic Frequency Modulation module selectively reinforces resilient frequencies, supporting better visual fidelity.
CSTFMark offers a scalable solution for video watermarking, addressing limitations of previous methods under real compression conditions.

Abstract

Recent deep learning-based video watermarking methods excel in non-compression settings but struggle under real H.264/AVC compression due to two key issues: (1) reliance on differentiable simulators that inaccurately model real encoding behaviors—especially quantization and motion estimation—causing a simulation-to-reality gap and poor high-compression robustness; and (2) failure to explicitly account for the frequency-selective attenuation imposed by the H.264 quantization matrix, hindering targeted protection of robust frequency bands. To bridge this gap, we propose CSTFMark (Curriculum-guided Spatio-Temporal Frequency Modulation for Robust Video Watermarking), a framework designed for real H.264/AVC compression. CSTFMark employs a 3D U-Net backbone and a Three-Stage Curriculum Learning strategy (TPC-H264) that progressively trains on undistorted videos, simulated compression, and finally real non-differentiable H.264 streams—stabilized by momentum-based gradient continuation. It features a Spatio-temporal Dynamic Frequency Modulation module (ST-DFM) to adaptively enhance compression-resilient frequencies, and a Hyper-prior Guided Embedding mechanism (HGE) for semantics-aware watermark modulation. Experiments on Kinetics-400 and UCF-101 show CSTFMark significantly outperforms state-of-the-art methods, maintaining high extraction accuracy even at high CRF. Thanks to a multi-scale spatiotemporal perception loss, watermarked videos retain excellent visual fidelity, temporal coherence, and support HD resolution with 32-frame inputs. CSTFMark offers a practical, scalable solution for robust video watermarking in real encoding pipelines.

CSTFMark: robust video watermarking against H.264/AVC compression via dynamic spati-temporal frequency modulation and curriculum learning

Key Points

Abstract

Cite This Study