What question did this study set out to answer?

This research aims to establish a standardized evaluation framework for assessing generative face-editing methods, focusing on image quality, identity preservation, and attribute disentanglement.

May 22, 2026Open Access

An evaluation framework for generative face-editing methods: Quality, identity and disentanglement

Key Points

This research aims to establish a standardized evaluation framework for assessing generative face-editing methods, focusing on image quality, identity preservation, and attribute disentanglement.
Comprehensive evaluation framework integrating both full-reference and no-reference methods including SSIM, LPIPS, FID, and DiffQA(R)-AI-KD.
Utilized three generative models (StarGAN, VecGAN, DiffAE) across three datasets for the evaluation.
Analyzed over 29,000 generated images to develop metrics for identity preservation and attribute entanglement.
The evaluation framework effectively quantifies identity preservation and attribute modification, revealing that stronger edits often increase attribute entanglement and reduce identity preservation.
Highlighting that biased training data, such as the CelebA dataset, impacts the performance of face-editing models by introducing demographic skew.
Identified key areas for improvement in future generative face-editing models based on evaluation metrics.

Abstract

With the advent of deep generative models, there has been some recent interest in the manipulation of people’s facial features. This has many potential applications in fashion and biometrics. However, it is a complex task. Indeed, a modification of a given attribute should not have any effect on the others, identity should be preserved, and image quality should not be altered. So far, the evaluation of the proposed methods has been mostly qualitative, which is insufficient to demonstrate progress and performance. We propose a comprehensive evaluation framework to estimate the quality of facial attribute editing methods with respect to several criteria: image quality, effective modification of the targeted attribute, level of entanglement between attributes and identity preservation. Three generative models are used to demonstrate the proposed evaluation framework over three datasets and three editing methods, resulting in the analysis of over 29k generated images. • We propose a standardized evaluation framework for face-editing models, addressing image quality, identity preservation, and attribute disentanglement. The framework integrates full-reference (SSIM, LPIPS, FID) and no-reference (DiffQA(R)-AI-KD) evaluation methods, along with identity preservation and attribute entanglement analysis. • Additionally, we introduce metrics to quantify identity preservation, facial landmarks deformation, and entanglement between attributes, enabling a comprehensive assessment of generative face-editing models. • We applied our proposed framework to evaluate three models : StarGAN, VecGAN, and DiffAE, it reveals that stronger attribute edits often increase entanglement and reduce identity preservation, highlighting key areas for improvement in future models. • Our experiments highlight the impact of biased training data on attribute entanglement. For instance, CelebA, composed of celebrity faces, exhibits demographic skew, with age and gender biases. These biases affect face-editing models, limiting accuracy and generalization across underrepresented groups.

An evaluation framework for generative face-editing methods: Quality, identity and disentanglement

Key Points

Abstract

Cite This Study