What question did this study set out to answer?

This research aims to enhance blind face restoration by incorporating textual information to improve facial detail reconstruction.

February 20, 2026

A Natural Language Guided Approach for Blind Face Restoration: Methodology and Dataset

Key Points

This research aims to enhance blind face restoration by incorporating textual information to improve facial detail reconstruction.
Developed a large-scale dataset of 30,000 textual descriptions and face images.
Introduced FaceCLIP, a vision-language model for aligning facial images with their textual descriptions.
Proposed Text-guided Blind Face Restoration (TBFR), integrating text guidance into the restoration process.
Implemented a text-aware loss to maintain semantic consistency in generated images.
TBFR outperforms existing blind face restoration methods based on quantitative metrics.
Subjective assessments indicate improved perceptual quality of reconstructed faces.
Demonstrated effective recovery of subtle facial features like wrinkles and skin marks.

Abstract

Blind Face Restoration (BFR) aims to reconstruct high-quality face images from low-quality inputs without any prior knowledge of the specific degradation types or levels. In recent years, remarkable progress has been achieved, particularly through GAN- and diffusion-based approaches, which have greatly improved perceptual realism and reconstruction fidelity. However, existing approaches typically rely solely on visual cues from degraded images. This often results in inaccurate reconstruction of facial details and noticeable identity distortion, particularly under severe or complex degradations. To address these limitations, we incorporate auxiliary textual information into BFR to enable the recovery of subtle facial attributes, such as wrinkles, moles, and skin marks that are often overlooked or hard to reconstruct by conventional visual priors. To support this idea, we first construct a large-scale dataset containing 30,000 detailed textual descriptions paired with CelebA-HQ face images, explicitly designed to capture fine-grained facial semantics. To effectively bridge the gap between visual data and natural language, we further propose FaceCLIP, a fine-tuned vision-language model specifically tailored to the human face. FaceCLIP enables more accurate alignment between face images and their corresponding textual descriptions by effectively capturing nuanced semantic cues critical for faithful face reconstruction. Built upon these foundations, we propose Text-guided Blind Face Restoration (TBFR), a novel diffusion-based framework that explicitly integrates textual guidance into the face restoration pipeline. Within TBFR, a text-guided hybrid attention block is designed to effectively fuse visual and textual features, while a text-aware loss is employed to enforce semantic consistency between the generated images and their associated textual descriptions. Extensive experimental results show that TBFR outperforms state-of-the-art BFR methods in terms of both quantitative metrics and subjective perceptual quality, establishing a new benchmark for BFR tasks.

Mark Helpful

Bookmark

Relay