Blind image quality assessment (BIQA) without reference images remains significantly challenging due to the fact that perceptual quality is largely determined by subtle, spatially localized distortions. However, existing Contrastive Language–Image Pre-training (CLIP)-based methods exhibit limited sensitivity to fine-grained degradations such as local blur, noise, compression artifacts, and exposure inconsistencies, since they are optimized for global semantic alignment. To overcome these limitations, we propose a fine-grained vision–language framework that enhances distortion-aware representation by considering both fine-grained visual and detailed textual domains. Specially, our method employs a fine-grained CLIP architecture in conjunction with explicit textual descriptions to enable the effective identification of subtle regional degradations. Furthermore, a parameter-efficient prompt-tuning strategy is utilized to facilitate the learning of task-adaptive prompt representations tailored to image quality assessment (IQA). Extensive experiments on three widely used in-the-wild IQA benchmarks show that the proposed method achieves strong consistency with human subjective judgments: our training-free FGCLIP-IQA reaches a maximum SROCC of 0.732 on KonIQ-10k, outperforming the vanilla CLIP-IQA baseline, while the prompt-tuned FGCLIP-IQA+ further achieves a maximum SROCC of 0.909 on KonIQ-10k with only a small number of learnable parameters and exhibits robust cross-dataset generalization capabilities. These results demonstrate that the fine-grained vision–language alignment shows great potential for future development, and provides an efficient and accurate solution for the BIQA task.
Tan et al. (Tue,) studied this question.