In the field of intelligent inspection, high-definition video data collected by quadruped robot dogs face severe transmission and storage constraints. Although existing advanced lossy video coding standards can significantly improve compression efficiency, they inevitably introduce severe compression artifacts in low-bit-rate scenarios. To address this issue, this paper proposes a video decoding quality enhancement network named Video Quality Restoration Network (VQRNet), based on a dual-stream architecture. Specifically, the Local Feature Extraction component incorporates a Progressive Feature Fusion Module (PFFM) with a four-stage progressive structure. By integrating reparameterized convolution and attention mechanisms, PFFM focuses on capturing high-frequency texture details to repair small-scale distortions. Simultaneously, the Multi-Scale Lightweight Spatial Attention Module (MLSA) performs spatial feature recalibration, leveraging multi-scale convolution to adaptively identify and enhance key spatial regions, specifically addressing multi-scale distortion. In the Global Feature Extraction component, the State-Space Attention Module (SSAM) combines State-Space Models (SSMs) with attention mechanisms to capture long-range dependencies and contextual information, for large-scale distortions caused by high-intensity compression. To verify the performance of the proposed algorithm, a dedicated dataset comprising 20 real-world video sequences captured by quadruped robot dogs (partitioned into 15 training and 5 testing sequences) was constructed, and the VTM 23.4 reference software was employed to simulate compression degradation using four quantization parameters (QP 30, 35, 40, and 45). Experimental results demonstrate that VQRNet outperforms state-of-the-art quality enhancement methods in terms of core metrics, including PSNR and SSIM, specifically including MIRNet, NAFNet, TRRHA, and CTNet. In the QP = 30 scenario, VQRNet achieves an average PSNR of 40.33 dB, a significant improvement of 3.32 dB over the VTM 23.4 baseline (37.01 dB), while demonstrating significant advantages in computational complexity and parameter efficiency—requiring only 5.27 G FLOPs and 1.40 M parameters, with an average inference latency of only 11.82 ms per 128 × 128 patch. This work provides robust technical support for the efficient video perception of quadruped robot dogs.
Feng et al. (Tue,) studied this question.