What question did this study set out to answer?

This evaluation aims to measure the residual vulnerability of frontier RLHF-aligned language models to published adversarial prompt attack classes.

May 2, 2026Open Access

Residual Manipulability of Frontier Aligned Language Models Under Published Prompt-Attack Classes: A Pre-Registered, Multi-Model Red-Team Study with Judge-Model Ablation

Key Points

This evaluation aims to measure the residual vulnerability of frontier RLHF-aligned language models to published adversarial prompt attack classes.
Conducted a pre-registered red-team evaluation on three Anthropic models (Claude-opus 4-7, Claude-sonnet 4-6, Claude-haiku 4-5).
Assessed behavior over 720 API calls (240 per model) using a three-dimensional rubric for scoring by a blind judge model.
Performed judge-model ablation by rescaling Opus using Claude-haiku to analyze dependence.
Four out of five attack categories had aggregate effect sizes below the threshold (Δharm < 0.3) for all models.
For Persona Override on Opus 4.7, Δharm was between +0.29 to +0.38, with significant variance demonstrated.
The most capable model (Opus 4.7) was found to be the most manipulable, while the least capable model (Haiku 4.5) displayed the most robustness.

Abstract

Frontier RLHF-aligned language models are widely deployed as general-purpose assistants. Their residual vulnerability to already-published adversarial-prompt attack classes is under-measured in the open literature: safety claims tend to rest on internal vendor evaluations whose methodology is not externally auditable. We report a pre-registered, fully reproducible red-team evaluation of three frontier Anthropic models — claude-opus-4-7, claude-sonnet-4-6, and claude-haiku-4-5 — under five attack classes drawn from the published adversarial-ML literature (persona override; few-shot context poisoning; multi-turn trust escalation; indirect prompt injection; authority impersonation). We measure behavioral drift toward six classes of moderate harm across 720 API calls (240 per model), scored on a three-dimensional rubric (refusal / harm severity / compliance quality) by a blinded Claude-as-judge, and we ablate the judge by re-scoring the Opus run with claude-haiku-4-5 to bound judge-model dependence. Aggregate effect sizes are small: four of five attack categories fall below the pre-registered Case B threshold (Δharm < 0.3) on every model. The single exception is Persona Override on Opus 4.7, where Δharm sits at +0.29 to +0.38 with bootstrap 95% CI reaching into Case A territory; judge-model choice flips the binary classification (Cohen's κ = 0.646; harm-severity Pearson r = 0.789). Our central empirical finding is an inverse within-vendor capability gradient: across aggregate Δharm, per-probe Δharm, refusal rate, and case-study mechanism, the most capable model (Opus 4.7) is the most marginally manipulable on these attacks, and the least capable model (Haiku 4.5) is the most robust. Mechanism analysis on the boundary triple shows that the gradient is not flat: greater capability buys flexibility to reject the persona override at the system-prompt layer while still complying with the underlying user request when judged benign; lower capability pattern-matches on the attack template itself and refuses end-to-end. This paper is one of two empirical anchors for the parent record Pharos Lighthouse (UN CTED submission, 2026-04-18; Zenodo DOI 10.5281/zenodo.19645912). The companion paper P8 (Enforcement, Not Attribution) addresses the defense-side feasibility question. All attack templates, probes, code, raw responses, scored cells, judge transcripts, and analysis pipelines are deposited under the AIACW Empirical Teasers concept DOI 10.5281/zenodo.19687373 (current v1.0: 10.5281/zenodo.19771546). Every claim is falsifiable by re-running the experiment.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Hangyu Mei (Wed,) studied this question.

synapsesocial.com/papers/69f594fc71405d493afffe11 https://doi.org/https://doi.org/10.5281/zenodo.19899470

KI fragen

Bookmark

View Full Paper