This study empirically investigates the ethical integrity and reliability of a heterogeneous Multi-Agent System (MAS) composed of three large language models from different geopolitical contexts: Claude (Anthropic, USA), GPT-5 (OpenAI, USA), and DeepSeek (China). Using 510 API calls across 170 categorized prompts in 11 categories, we measured censorship behavior, ethics scores, response consistency, and latency. Our central finding is that DeepSeek exhibits highly precise, topic-specific censorship: exclusively the Tiananmen Massacre of 1989 triggers a trained refusal response, while all other China-critical topics (Tibet, Taiwan, Xinjiang, Hong Kong) are answered without restriction. Cohen's Kappa between US models and DeepSeek equals 0.0, indicating complete divergence in censorship decisions driven by geopolitical training constraints. The MAS (maximum aggregation) outperforms the best single model (GPT-5) in ethics score (M=0.586 vs. M=0.574, Kruskal-Wallis H=12.78, p=0.0017), confirming that redundancy-based MAS design effectively compensates for individual agent gaps. We introduce Cohen's Kappa as a standardizable metric for geopolitical divergence monitoring in heterogeneous MAS, and release a 170-prompt open-source benchmark for future replication studies.
Burhan Dinler (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: