This preprint documents a targeted layer-retention method for mixed-precision NF4 quantization of diffusion-transformer image models. The method retains architecture-specific first, last, and boundary modules in bfloat16 while quantizing the middle transformer body to 4-bit NormalFloat (NF4) with double quantization. The central claim is that useful compression depends not only on using NF4, but on selecting which layers should not be quantized. Public Hugging Face artifacts demonstrate this rule across text-conditioned image editing with Qwen-Image-Edit variants and text-to-image generation with ERNIE-Image. The first related public artifact, ovedrive/qwen-image-edit-4bit, was released on Hugging Face on 19 August 2025. This Zenodo record archives the manuscript describing the method, artifact history, deployment observations, limitations, and future benchmark directions.
Abhishek Dujari (Wed,) studied this question.