What question did this study set out to answer?

This research examines the ethical integrity and reliability of large language models in a heterogeneous multi-agent system.

March 15, 2026Open Access

Ethics and Reliability in Heterogeneous Multi-Agent LLM Systems: An Empirical Analysis of Claude, GPT-5, and DeepSeek

Key Points

This research examines the ethical integrity and reliability of large language models in a heterogeneous multi-agent system.
Conducted 510 API calls across 170 categorized prompts in 11 categories.
Assessed censorship behavior, ethics scores, response consistency, and latency.
Used Cohen's Kappa to evaluate divergence in censorship decisions.
DeepSeek shows precise censorship of the Tiananmen Massacre while allowing discourse on other China-critical topics.
The maximum aggregation of the multi-agent system outperforms GPT-5 in ethics score (M=0.586 vs. M=0.574).
Cohen's Kappa between US models and DeepSeek is 0.0, revealing complete divergence in censorship.

Abstract

This study empirically investigates the ethical integrity and reliability of a heterogeneous Multi-Agent System (MAS) composed of three large language models from different geopolitical contexts: Claude (Anthropic, USA), GPT-5 (OpenAI, USA), and DeepSeek (China). Using 510 API calls across 170 categorized prompts in 11 categories, we measured censorship behavior, ethics scores, response consistency, and latency. Our central finding is that DeepSeek exhibits highly precise, topic-specific censorship: exclusively the Tiananmen Massacre of 1989 triggers a trained refusal response, while all other China-critical topics (Tibet, Taiwan, Xinjiang, Hong Kong) are answered without restriction. Cohen's Kappa between US models and DeepSeek equals 0.0, indicating complete divergence in censorship decisions driven by geopolitical training constraints. The MAS (maximum aggregation) outperforms the best single model (GPT-5) in ethics score (M=0.586 vs. M=0.574, Kruskal-Wallis H=12.78, p=0.0017), confirming that redundancy-based MAS design effectively compensates for individual agent gaps. We introduce Cohen's Kappa as a standardizable metric for geopolitical divergence monitoring in heterogeneous MAS, and release a 170-prompt open-source benchmark for future replication studies.

Ethics and Reliability in Heterogeneous Multi-Agent LLM Systems: An Empirical Analysis of Claude, GPT-5, and DeepSeek

Key Points

Abstract

Cite This Study

Also Consider

Also Consider