What question did this study set out to answer?

To develop a method that preserves artistic intent during style transfer across diverse media using learnable dynamic style embeddings.

April 3, 2026Open Access

Cross-media style transfer in art: preserving artistic intent in diverse media using GANs

Key Points

To develop a method that preserves artistic intent during style transfer across diverse media using learnable dynamic style embeddings.
Introduced dynamic tokens for 27 distinct artistic styles in stable diffusion models.
Implemented an end-to-end training pipeline combining BLIP-generated captions and a multi-objective loss function.
Utilized style loss, perceptual loss, and blend loss to achieve coherent style representation.
Achieved a FID score of 176.59, indicating improved distributional alignment compared to baseline.
Secured a CLIP score of 35.48 for image-text-style coherence and 89.91% accuracy in zero-shot style classification.
Demonstrated robust semantic adherence and style fidelity in qualitative assessments.

Abstract

While stable diffusion and other diffusion-based text-to-image models are great at semantic synthesis, they fail to faithfully represent the rich artistic styles of fine art paintings without heavy finetuning, reference images or other architectural components. We introduce a simple method: learnable dynamic style embeddings, i.e. one dynamic token per style directly embedded in the stable Diffusion conditional stream. The style embedding provides accurate, reference-free control over 27 distinct WikiArt styles (Ex. Impressionism, Cubism, Pop Art), allows for gradual style blending through linear interpolation of the embeddings, and does not require token separation or multi-stage networks as in previous techniques like StyleForge or Style2Talker. Our end-to-end training pipeline fuses BLIP-generated captions with a weighted multi-objective loss: (1) style loss (Gram-matrix based ensuring texture/pattern fidelity), (2) perceptual loss (VGG features ensuring content/structure preservation), and (3) blend loss (smoothing multi-style transitions). With the technique trained on a downsampled WikiArt subset (~ 8000 images), the proposed method outperforms the untuned Stable Diffusion baseline, obtaining FID of 176.59 (vs 183.23), indicating better distributional alignment in the challenging domain of art, a CLIP score of 35. 48 for the image-text-style coherence, and 89.91% zero-shot style classification accuracy with CLIP. Qualitative results demonstrate further that semantic adherence and style fidelity are achieved robustly. This resource-efficient, modular-free architecture opens a new avenue to truly accessible artistic style transfer and generation, with potential real-world impact in creative AI.

Bookmark

View Full Paper

Bookmark

View Full Paper

Cross-media style transfer in art: preserving artistic intent in diverse media using GANs

Key Points

Abstract

Cite This Study