Description Large language models (LLMs) are increasingly being proposed as evaluators of other LLMs for benchmarking, red teaming, safety assessment, and automated peer review. However, an important methodological question remains insufficiently explored: how stable are these evaluations when the auditor knows the identity of the system being evaluated, and when the language of the auditing instruction differs from the language of the evidence being audited? This study investigates these questions through three experimental components involving nine commercial models: ChatGPT, Claude, Copilot, DeepSeek, Gemini, Grok, Meta AI, Perplexity, and Qwen. A central design feature of this work is the use of clean, independent chat sessions with no cross-session memory contamination. In addition, the bilingual evidence corpus used as the audit target was generated independently in each language rather than translated, allowing genuine cross-lingual triangulation and avoiding translation artifacts. The study reports three principal findings: Identity disclosure bias. Single-auditor peer review reveals five recurrent mechanisms of auditor instability: Register Collapse Selective Evidence Loss Reconstructive Instability Scope Narrowing Asymmetric Severity Redistribution Cross-lingual audit instability. In some models, changing only the language of the auditing prompt while keeping the underlying evidence constant produces diagnostically incompatible assessments. Dual-auditor robustness. Independent parallel evaluation by two architectures produces more stable assessments than single-auditor designs. Across all experimental components, the main methodological finding is that self-reported numeric scores are unreliable indicators of auditor stability. Models may assign identical scores to qualitatively different diagnoses or report high confidence while silently omitting substantive findings. The results suggest that LLM-to-LLM evaluation should not rely exclusively on self-reported metrics or single-auditor designs. Instead, robust evaluation frameworks should incorporate qualitative analysis of justificatory text, independent replication, cross-lingual validation, and multi-auditor architectures. This work contributes to the growing literature on LLM-as-a-Judge, AI alignment evaluation, cross-lingual robustness, and AI safety assessment. Keywords: large language models, LLM-as-a-Judge, auditor bias, peer review, cross-lingual evaluation, AI safety, alignment, model evaluation, automated auditing.
Evans Tovar (Wed,) studied this question.