What question did this study set out to answer?

To investigate refusal patterns in large language models when responding to Taiwan political prompts and challenge the East-West dichotomy.

April 24, 2026Open Access

Vendor-Specific Refusal Patterns in LLM Responses to Taiwan-Political Prompts: Evidence Against a Monolithic East–West Alignment Dichotomy

Key Points

To investigate refusal patterns in large language models when responding to Taiwan political prompts and challenge the East-West dichotomy.
Audited five commercial large language models with 200 Traditional Chinese prompts.
Classified vendor responses into four categories: hard refusal, soft refusal, on-task, API-blocked.
Conducted bootstrap paired statistical analysis to evaluate refusal distributions.
Chinese-owned vendors showed the most divergent refusal distributions, refuting the East-West alignment claim.
One vendor blocked certain neutral prompts, indicating a focus on Taiwan-statehood issues.
Findings remained consistent across flagship model capabilities, indicating robustness in vendor behavioral patterns.

Abstract

We audit five commercial large language models (OpenAI gpt-4o-mini, Google gemini-2.5- flash-lite, xAI grok-4-fast, DeepSeek V3.2, and Moonshot Kimi k2) on 200 Traditional Chinese prompts designed to probe Taiwan political sensitivity. Each vendor responds to each prompt under a fixed generation configuration, yielding 1,000 observations. Hand-labeled responses are classified along a four-category taxonomy (hard refusal, soft refusal, on-task, API-blocked), with all statistics reported under prompt-level paired-bootstrap 95% BCa confidence inter- vals. Four findings emerge. The intuitive East-West alignment dichotomy is refuted: the two Chinese-owned vendors produce the most divergent refusal distributions in the panel (JSD 0.200, CI 0.149, 0.256), while DeepSeek’s aggregate distribution is statistically indis- tinguishable from the U.S. vendors. Kimi’s 7% API-level content filter rejects 4 of 50 OT- expected neutral factual prompts about Republic of China state institutions, supporting a Taiwan-statehood blocking rather than sovereignty-opinion blocking reading. A topic-stratified view reveals a four-profile vendor taxonomy. DeepSeek’s sovereignty on-task rate collapses to 10.3% (2.6, 23.3) while its non-sovereignty behavior matches Western vendors, a disjoint-CI collapse unique in the panel. An HR→SR elasticity analysis separates responsive-RLHF ven- dors from ceiling-bound and stiff-RLHF vendors. A 40-prompt flagship-tier sensitivity subset shows these four findings retain their qualitative character when OpenAI, Gemini, and Grok are queried at capability-matched flagship endpoints, so the observed inter-vendor divergence is not a model-scale artifact. Code, prompts, per-response logs, hand-labels, and the auxiliary AI-judge audit trail are released. For LLM agent simulation in politically sensitive domains, we recommend treating vendor as a first-class experimental variable and reporting layer-stratified refusal metrics.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper