Generative Artificial Intelligence (AI) is increasingly used for zero-shot text classification in social science, yet its outputs exhibit inherent stochasticity. Because reliability is a necessary condition for validity in content analysis methodology, this stochasticity poses a fundamental challenge, yet no systematic framework exists for quantifying and governing classification reliability prior to validity evaluation. This study proposes the Semantic Stability Protocol, which conceptualizes repeated large language model (LLM) outputs as structured groups of “AI coders” and applies traditional intercoder reliability metrics to assess classification consistency. Using DeepSeek Reasoner to classify 424 Chinese news articles into five categories within a single-model, single-language, single-domain configuration (100 runs per article), we find that raw outputs already exhibit high internal consistency (Krippendorff’s α = 0.8485) and that approximately 20 runs suffice for α > 0.94 after aggregation. Central to the protocol is a stability-stratified escalation framework: two diagnostic indicators, the Majority Rate and the Confidence Gap, partition each classification into High-, Moderate-, or Low-stability strata, triggering differentiated procedures: High-stability cases accept aggregated decisions directly, Moderate-stability cases undergo additional runs to reassess consistency, and Low-stability cases are flagged for human review. This study illustrates that generative model stochasticity can be governed within established reliability frameworks, providing researchers with actionable guidance (minimum run counts, aggregation strategy selection, and stability diagnostics) for transforming zero-shot classification into a transparent, auditable procedure.
Hsu et al. (Thu,) studied this question.