While stable diffusion and other diffusion-based text-to-image models are great at semantic synthesis, they fail to faithfully represent the rich artistic styles of fine art paintings without heavy finetuning, reference images or other architectural components. We introduce a simple method: learnable dynamic style embeddings, i.e. one dynamic token per style directly embedded in the stable Diffusion conditional stream. The style embedding provides accurate, reference-free control over 27 distinct WikiArt styles (Ex. Impressionism, Cubism, Pop Art), allows for gradual style blending through linear interpolation of the embeddings, and does not require token separation or multi-stage networks as in previous techniques like StyleForge or Style2Talker. Our end-to-end training pipeline fuses BLIP-generated captions with a weighted multi-objective loss: (1) style loss (Gram-matrix based ensuring texture/pattern fidelity), (2) perceptual loss (VGG features ensuring content/structure preservation), and (3) blend loss (smoothing multi-style transitions). With the technique trained on a downsampled WikiArt subset (~ 8000 images), the proposed method outperforms the untuned Stable Diffusion baseline, obtaining FID of 176.59 (vs 183.23), indicating better distributional alignment in the challenging domain of art, a CLIP score of 35. 48 for the image-text-style coherence, and 89.91% zero-shot style classification accuracy with CLIP. Qualitative results demonstrate further that semantic adherence and style fidelity are achieved robustly. This resource-efficient, modular-free architecture opens a new avenue to truly accessible artistic style transfer and generation, with potential real-world impact in creative AI.
Xi Cao (Tue,) studied this question.