What question did this study set out to answer?

This study investigates how knowledge of an LLM's identity and language differences affect peer evaluations between language models.

June 26, 2026Open Access

How LLMs Audit Each Other: Five Mechanisms of Auditor Bias in Cross-Model Peer Review Under Identity Disclosure and Cross-Lingual Conditions

Key Points

This study investigates how knowledge of an LLM's identity and language differences affect peer evaluations between language models.
Conducted three experimental components involving nine commercial LLMs: ChatGPT, Claude, Copilot, and others.
Employed independent chat sessions to prevent memory contamination in evaluations.
Used a bilingual evidence corpus generated independently in each language for authentic cross-lingual analysis.
Identified five mechanisms of auditor instability including register collapse and selective evidence loss.
Found that changing the language of the auditing prompt led to incompatible assessments in some models.
Confirmed that independent evaluations by two auditors yield more stable results compared to single-auditor evaluations.

Abstract

Description Large language models (LLMs) are increasingly being proposed as evaluators of other LLMs for benchmarking, red teaming, safety assessment, and automated peer review. However, an important methodological question remains insufficiently explored: how stable are these evaluations when the auditor knows the identity of the system being evaluated, and when the language of the auditing instruction differs from the language of the evidence being audited? This study investigates these questions through three experimental components involving nine commercial models: ChatGPT, Claude, Copilot, DeepSeek, Gemini, Grok, Meta AI, Perplexity, and Qwen. A central design feature of this work is the use of clean, independent chat sessions with no cross-session memory contamination. In addition, the bilingual evidence corpus used as the audit target was generated independently in each language rather than translated, allowing genuine cross-lingual triangulation and avoiding translation artifacts. The study reports three principal findings: Identity disclosure bias. Single-auditor peer review reveals five recurrent mechanisms of auditor instability: Register Collapse Selective Evidence Loss Reconstructive Instability Scope Narrowing Asymmetric Severity Redistribution Cross-lingual audit instability. In some models, changing only the language of the auditing prompt while keeping the underlying evidence constant produces diagnostically incompatible assessments. Dual-auditor robustness. Independent parallel evaluation by two architectures produces more stable assessments than single-auditor designs. Across all experimental components, the main methodological finding is that self-reported numeric scores are unreliable indicators of auditor stability. Models may assign identical scores to qualitatively different diagnoses or report high confidence while silently omitting substantive findings. The results suggest that LLM-to-LLM evaluation should not rely exclusively on self-reported metrics or single-auditor designs. Instead, robust evaluation frameworks should incorporate qualitative analysis of justificatory text, independent replication, cross-lingual validation, and multi-auditor architectures. This work contributes to the growing literature on LLM-as-a-Judge, AI alignment evaluation, cross-lingual robustness, and AI safety assessment. Keywords: large language models, LLM-as-a-Judge, auditor bias, peer review, cross-lingual evaluation, AI safety, alignment, model evaluation, automated auditing.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper