Abstract Robustness testing for large language model based recommender systems (LLM4Rec) typically relies on a handful of handwritten attack prompts drawn from one injection pattern at a single perturbation rate. These narrow test suites miss most of the attack surface and paint an overly optimistic picture of system security. We introduce an automated red teaming framework that replaces static templates with an adaptive loop. An attacker model generates diverse adversarial prompts across sixteen attack categories; a judge then scores each one using ranking distortion metrics. Successful attacks are fed back into the defense for iterative hardening. We run the framework against RoLLMRec on MovieLens, Amazon Books, and Yelp. The loop exposes attack success rates of 42.4%, 48.1%, and 52.0% on the three benchmarks, well above static, paraphrase, and PAIR-style baselines. After the hardening cycle, vulnerability drops by 56.4%, 74.3%, and 68.8% while false positive rates stay at or below 0.6%; module ablations indicate the four defense components combine super-additively, with the prompt shield contributing the largest single share. The full attack corpus is publicly available to support reproducible adversarial benchmarking in LLM4Rec.
Shehmir et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: