What question did this study set out to answer?

This research aims to improve robustness testing methods for large language model-based recommender systems by implementing automated adversarial red teaming.

June 5, 2026Open Access

Automated adversarial red-teaming for evaluating robustness in LLM-based recommender systems

Key Points

This research aims to improve robustness testing methods for large language model-based recommender systems by implementing automated adversarial red teaming.
Utilized an adaptive loop for generating diverse adversarial prompts across sixteen attack categories.
Employed ranking distortion metrics for scoring each attack prompt.
Conducted tests against RoLLMRec on three datasets: MovieLens, Amazon Books, and Yelp.
Achieved attack success rates of 42.4%, 48.1%, and 52.0% on MovieLens, Amazon Books, and Yelp, respectively.
Post-hardening, vulnerabilities decreased by 56.4%, 74.3%, and 68.8% while maintaining false positive rates at or below 0.6%.
Identified that the four defense components worked together super-additively for enhanced security.

Abstract

Abstract Robustness testing for large language model based recommender systems (LLM4Rec) typically relies on a handful of handwritten attack prompts drawn from one injection pattern at a single perturbation rate. These narrow test suites miss most of the attack surface and paint an overly optimistic picture of system security. We introduce an automated red teaming framework that replaces static templates with an adaptive loop. An attacker model generates diverse adversarial prompts across sixteen attack categories; a judge then scores each one using ranking distortion metrics. Successful attacks are fed back into the defense for iterative hardening. We run the framework against RoLLMRec on MovieLens, Amazon Books, and Yelp. The loop exposes attack success rates of 42.4%, 48.1%, and 52.0% on the three benchmarks, well above static, paraphrase, and PAIR-style baselines. After the hardening cycle, vulnerability drops by 56.4%, 74.3%, and 68.8% while false positive rates stay at or below 0.6%; module ablations indicate the four defense components combine super-additively, with the prompt shield contributing the largest single share. The full attack corpus is publicly available to support reproducible adversarial benchmarking in LLM4Rec.

Ask AI

Helpful

Bookmark

View Full Paper