Frontier RLHF-aligned language models are widely deployed as general-purpose assistants. Their residual vulnerability to already-published adversarial-prompt attack classes is under-measured in the open literature: safety claims tend to rest on internal vendor evaluations whose methodology is not externally auditable. We report a pre-registered, fully reproducible red-team evaluation of three frontier Anthropic models — claude-opus-4-7, claude-sonnet-4-6, and claude-haiku-4-5 — under five attack classes drawn from the published adversarial-ML literature (persona override; few-shot context poisoning; multi-turn trust escalation; indirect prompt injection; authority impersonation). We measure behavioral drift toward six classes of moderate harm across 720 API calls (240 per model), scored on a three-dimensional rubric (refusal / harm severity / compliance quality) by a blinded Claude-as-judge, and we ablate the judge by re-scoring the Opus run with claude-haiku-4-5 to bound judge-model dependence. Aggregate effect sizes are small: four of five attack categories fall below the pre-registered Case B threshold (Δharm < 0.3) on every model. The single exception is Persona Override on Opus 4.7, where Δharm sits at +0.29 to +0.38 with bootstrap 95% CI reaching into Case A territory; judge-model choice flips the binary classification (Cohen's κ = 0.646; harm-severity Pearson r = 0.789). Our central empirical finding is an inverse within-vendor capability gradient: across aggregate Δharm, per-probe Δharm, refusal rate, and case-study mechanism, the most capable model (Opus 4.7) is the most marginally manipulable on these attacks, and the least capable model (Haiku 4.5) is the most robust. Mechanism analysis on the boundary triple shows that the gradient is not flat: greater capability buys flexibility to reject the persona override at the system-prompt layer while still complying with the underlying user request when judged benign; lower capability pattern-matches on the attack template itself and refuses end-to-end. This paper is one of two empirical anchors for the parent record Pharos Lighthouse (UN CTED submission, 2026-04-18; Zenodo DOI 10.5281/zenodo.19645912). The companion paper P8 (Enforcement, Not Attribution) addresses the defense-side feasibility question. All attack templates, probes, code, raw responses, scored cells, judge transcripts, and analysis pipelines are deposited under the AIACW Empirical Teasers concept DOI 10.5281/zenodo.19687373 (current v1.0: 10.5281/zenodo.19771546). Every claim is falsifiable by re-running the experiment.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hangyu Mei
Building similarity graph...
Analyzing shared references across papers
Loading...
Hangyu Mei (Wed,) studied this question.
www.synapsesocial.com/papers/69f594fc71405d493afffe11 — DOI: https://doi.org/10.5281/zenodo.19899470