What question did this study set out to answer?

The aim is to enhance the quality of video data from quadruped robots while reducing compression artifacts in low-bit-rate scenarios.

March 13, 2026Open Access

Lightweight State-Space Model-Based Video Quality Enhancement for Quadruped Robot Dog Decoded Streams

Key Points

The aim is to enhance the quality of video data from quadruped robots while reducing compression artifacts in low-bit-rate scenarios.
Developed a video quality enhancement network named VQRNet using a dual-stream architecture.
Employed a Progressive Feature Fusion Module (PFFM) to capture high-frequency texture details.
Utilized a Multi-Scale Lightweight Spatial Attention Module (MLSA) for identifying and enhancing spatial features.
Constructed a dedicated dataset with 20 real-world video sequences for training and testing.
VQRNet achieved an average PSNR of 40.33 dB in the QP = 30 scenario, outperforming baseline VTM 23.4 by 3.32 dB.
Demonstrated lower computational complexity with only 5.27 G FLOPs and 1.40 M parameters.
Achieved an average inference latency of 11.82 ms per 128 × 128 patch.

Abstract

In the field of intelligent inspection, high-definition video data collected by quadruped robot dogs face severe transmission and storage constraints. Although existing advanced lossy video coding standards can significantly improve compression efficiency, they inevitably introduce severe compression artifacts in low-bit-rate scenarios. To address this issue, this paper proposes a video decoding quality enhancement network named Video Quality Restoration Network (VQRNet), based on a dual-stream architecture. Specifically, the Local Feature Extraction component incorporates a Progressive Feature Fusion Module (PFFM) with a four-stage progressive structure. By integrating reparameterized convolution and attention mechanisms, PFFM focuses on capturing high-frequency texture details to repair small-scale distortions. Simultaneously, the Multi-Scale Lightweight Spatial Attention Module (MLSA) performs spatial feature recalibration, leveraging multi-scale convolution to adaptively identify and enhance key spatial regions, specifically addressing multi-scale distortion. In the Global Feature Extraction component, the State-Space Attention Module (SSAM) combines State-Space Models (SSMs) with attention mechanisms to capture long-range dependencies and contextual information, for large-scale distortions caused by high-intensity compression. To verify the performance of the proposed algorithm, a dedicated dataset comprising 20 real-world video sequences captured by quadruped robot dogs (partitioned into 15 training and 5 testing sequences) was constructed, and the VTM 23.4 reference software was employed to simulate compression degradation using four quantization parameters (QP 30, 35, 40, and 45). Experimental results demonstrate that VQRNet outperforms state-of-the-art quality enhancement methods in terms of core metrics, including PSNR and SSIM, specifically including MIRNet, NAFNet, TRRHA, and CTNet. In the QP = 30 scenario, VQRNet achieves an average PSNR of 40.33 dB, a significant improvement of 3.32 dB over the VTM 23.4 baseline (37.01 dB), while demonstrating significant advantages in computational complexity and parameter efficiency—requiring only 5.27 G FLOPs and 1.40 M parameters, with an average inference latency of only 11.82 ms per 128 × 128 patch. This work provides robust technical support for the efficient video perception of quadruped robot dogs.

Lightweight State-Space Model-Based Video Quality Enhancement for Quadruped Robot Dog Decoded Streams

Key Points

Abstract

Cite This Study