This contribution is an extended abstract of the paper originally published in the IEEE Transactions on Software Engineering (TSE) Co25. The paper presents EvoTox, an automated black-box testing framework that uses evolutionary search to assess Large Language Models’ susceptibility to generating toxic content through natural, realistic prompts. Empirical evaluation on five state-of-the-art LLMs (7-671B parameters) shows that EvoTox significantly outperforms existing baseline methods in detecting toxicity with effect sizes up to 1.0, while maintaining limited cost overhead (22−35%) and generating human-like prompts validated by domain experts.
Corbo et al. (Thu,) studied this question.