We audit five commercial large language models (OpenAI gpt-4o-mini, Google gemini-2.5- flash-lite, xAI grok-4-fast, DeepSeek V3.2, and Moonshot Kimi k2) on 200 Traditional Chinese prompts designed to probe Taiwan political sensitivity. Each vendor responds to each prompt under a fixed generation configuration, yielding 1,000 observations. Hand-labeled responses are classified along a four-category taxonomy (hard refusal, soft refusal, on-task, API-blocked), with all statistics reported under prompt-level paired-bootstrap 95% BCa confidence inter- vals. Four findings emerge. The intuitive East-West alignment dichotomy is refuted: the two Chinese-owned vendors produce the most divergent refusal distributions in the panel (JSD 0.200, CI 0.149, 0.256), while DeepSeek’s aggregate distribution is statistically indis- tinguishable from the U.S. vendors. Kimi’s 7% API-level content filter rejects 4 of 50 OT- expected neutral factual prompts about Republic of China state institutions, supporting a Taiwan-statehood blocking rather than sovereignty-opinion blocking reading. A topic-stratified view reveals a four-profile vendor taxonomy. DeepSeek’s sovereignty on-task rate collapses to 10.3% (2.6, 23.3) while its non-sovereignty behavior matches Western vendors, a disjoint-CI collapse unique in the panel. An HR→SR elasticity analysis separates responsive-RLHF ven- dors from ceiling-bound and stiff-RLHF vendors. A 40-prompt flagship-tier sensitivity subset shows these four findings retain their qualitative character when OpenAI, Gemini, and Grok are queried at capability-matched flagship endpoints, so the observed inter-vendor divergence is not a model-scale artifact. Code, prompts, per-response logs, hand-labels, and the auxiliary AI-judge audit trail are released. For LLM agent simulation in politically sensitive domains, we recommend treating vendor as a first-class experimental variable and reporting layer-stratified refusal metrics.
Cheng-Hsun Tseng (Wed,) studied this question.