Video Quality Assessment (VQA) strives to computationally emulate human perceptual judgments and has garnered significant attention given its widespread applicability. However, existing methodologies face two primary impediments: (1) limited proficiency in evaluating samples at quality extremes (e.g., severely degraded or near-perfect videos), and (2) insufficient sensitivity to nuanced quality variations arising from a misalignment with human perceptual mechanisms. Although vision-language models offer promising semantic understanding, their reliance on visual encoders pre-trained for high-level tasks often compromises their sensitivity to low-level distortions. To surmount these challenges, we propose the Restoration-Assisted Multi-modality VQA (RAM-VQA) framework. Uniquely, our approach leverages video restoration as a proxy to explicitly model distortion-sensitive features. The framework operates through two synergistic stages: a prompt learning stage that constructs a quality-aware textual space using triple-level references (degraded, restored, and pristine) derived from the restoration process, and a dual-branch evaluation stage that integrates semantic cues with technical quality indicators via spatio-temporal differential analysis. Extensive experiments demonstrate that RAM-VQA achieves state-of-the-art performance across diverse benchmarks, exhibiting superior capability in handling extreme-quality content while ensuring robust generalization.
Chen et al. (Thu,) studied this question.