March 3, 2026Open Access

Detecting inappropriate material used to train AI image generation models

Key Points

Appropriate images are necessary for training generative AI models to prevent harmful outputs.
Key challenges involve producing suspicious images and obfuscating prompts that yield inappropriate results.
Proposed solutions include a multi-layered framework combining embedding analysis and trajectory classification.
Future work should test the effectiveness of the suggested detection strategies in controlled experiments.

Abstract

Generative AI models, for example diffusion models, have emerged as state-of-the-art methods for generating novel images described by a text prompt. Open-source AI models can furthermore be fine-tuned to produce images similar to a given dataset of images. However, bad actors may seek to use illegal images to fine-tune a model so that it produces inappropriate and harmful images. We investigate various methods for detecting whether such images have been used to fine-tune a given diffusion model. This task raises two key challenges: (1) Images from a suspicious model should not be produced. (2) Any prompts yielding inappropriate images may be obfuscated. We propose a multi-layered framework to overcome these challenges. We combine embedding analysis, trajectory classification, parameter inspections, and neural network encoding in a promising framework, and suggest that controlled experiments should be conducted to test this strategy in future work.

Bookmark

View Full Paper

Bookmark

View Full Paper

Detecting inappropriate material used to train AI image generation models

Key Points

Abstract

Cite This Study