What question did this study set out to answer?

The research aims to improve blind image quality assessment by addressing limitations in existing methods that overlook fine-grained distortions.

March 27, 2026Open Access

Fine-Grained Vision-Language Method with Prompt Tuning for Blind Image Quality Assessment

Key Points

The research aims to improve blind image quality assessment by addressing limitations in existing methods that overlook fine-grained distortions.
Developed a fine-grained vision-language framework utilizing a fine-grained CLIP architecture.
Incorporated explicit textual descriptions to enhance distortion identification.
Employed a prompt-tuning strategy for adaptive learning specific to image quality assessment.
Achieved a maximum SROCC of 0.732 on the KonIQ-10k benchmark with the FGCLIP-IQA method.
The prompt-tuned FGCLIP-IQA+ reached a maximum SROCC of 0.909 on the same benchmark.
Demonstrated strong consistency with human subjective judgments and robust generalization across datasets.

Abstract

Blind image quality assessment (BIQA) without reference images remains significantly challenging due to the fact that perceptual quality is largely determined by subtle, spatially localized distortions. However, existing Contrastive Language–Image Pre-training (CLIP)-based methods exhibit limited sensitivity to fine-grained degradations such as local blur, noise, compression artifacts, and exposure inconsistencies, since they are optimized for global semantic alignment. To overcome these limitations, we propose a fine-grained vision–language framework that enhances distortion-aware representation by considering both fine-grained visual and detailed textual domains. Specially, our method employs a fine-grained CLIP architecture in conjunction with explicit textual descriptions to enable the effective identification of subtle regional degradations. Furthermore, a parameter-efficient prompt-tuning strategy is utilized to facilitate the learning of task-adaptive prompt representations tailored to image quality assessment (IQA). Extensive experiments on three widely used in-the-wild IQA benchmarks show that the proposed method achieves strong consistency with human subjective judgments: our training-free FGCLIP-IQA reaches a maximum SROCC of 0.732 on KonIQ-10k, outperforming the vanilla CLIP-IQA baseline, while the prompt-tuned FGCLIP-IQA+ further achieves a maximum SROCC of 0.909 on KonIQ-10k with only a small number of learnable parameters and exhibits robust cross-dataset generalization capabilities. These results demonstrate that the fine-grained vision–language alignment shows great potential for future development, and provides an efficient and accurate solution for the BIQA task.

Fine-Grained Vision-Language Method with Prompt Tuning for Blind Image Quality Assessment

Key Points

Abstract

Cite This Study