What question did this study set out to answer?

This research investigates whether a causal safety audit developed for one large language model (LLM) can be effectively transferred to others.

April 24, 2026Open Access

WhyLab v3: Cross-Model Evaluation of a Causal Safety Audit for Self-Improving LLM Agents

Key Points

This research investigates whether a causal safety audit developed for one large language model (LLM) can be effectively transferred to others.
Utilized cognitive policy oscillation to map an instability phase diagram across various model conditions.
Implemented a three-component causal audit across six LLM families on a standardized adversarial benchmark.
Evaluated regression effects and accuracy changes using statistical analysis.
Only the Gemini model family showed significant regression reduction, with 20.9% improvement in Gemini 2.0.
All other models, including GPT-4o-mini and Llama 3 variants, exhibited null or negative effects from the audit.
Paired accuracy decreased across all models, indicating a need for per-model threshold calibration.

Abstract

Status: NeurIPS 2026 submission under double-blind review. Author identity anonymized. Can a causal safety audit designed for one LLM transfer to others? We study cognitive policy oscillation, map an instability phase diagram (384 synthetic + 32 LLM conditions, sharp boundary at h approximately 0.2), and implement a three-component causal audit (WhyLab: C1 drift + C2 E-value filter + C3 Lyapunov damping). Cross-model evaluation of the fixed C2 audit across six LLM families (Gemini 2.0/2.5 Flash, GPT-4o-mini, Llama 3:8b, Llama 3.1:8b, Dolphin-Llama3:8b) on an identical adversarial fact-tracking benchmark. Only the Gemini family shows regression reduction (Gemini 2.0: +20.9%, p=0.088; Gemini 2.5: +100% underpowered). GPT-4o-mini, all Llama 3 variants, and Dolphin-Llama3 show null or negative audit effect. Paired accuracy reduced on all six models (Cohen's d from -0.46 to -1.09). Rejection rate mechanism: spans 0.5 to 13.75 per trajectory across models (27x spread) for an unchanged filter threshold, identifying per-model threshold calibration as the binding deployment constraint. Headline correction: previously reported 44% regression reduction on Gemini 2.0 Flash corrected to 20.9% under paired reanalysis of all 20 seeds; no longer Bonferroni-significant after six-model family adjustment. Contributions: (1) instability phase diagram for self-improving LLM agents, (2) cross-model reproducibility evaluation of a causal safety audit, (3) rejection-rate mechanism analysis identifying per-model calibration as binding constraint. Change log (v3 vs v2): Expanded E7v2 benchmark to six LLM families; 60 paired seeds total; headline number corrected (44% -> 20.9%); abstract / introduction / conclusion rewritten for cross-model framing; phase diagram preserved as primary scientific artifact; Codex anonymization blockers removed.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper