What question did this study set out to answer?

The main aim is to propose a framework that quantifies reliability in zero-shot classification using generative AI.

May 9, 2026Open Access

Semantic stability protocol: intercoder reliability for zero-shot classification via AI coders

Key Points

The main aim is to propose a framework that quantifies reliability in zero-shot classification using generative AI.
Applied the Semantic Stability Protocol using DeepSeek Reasoner for classifying 424 Chinese news articles.
Executed 100 runs per article in a single-model and single-language configuration.
Utilized traditional intercoder reliability metrics to assess consistency and stability.
Achieved high internal consistency with Krippendorff’s α = 0.8485 in raw outputs.
Showed that approximately 20 runs provided α > 0.94 after aggregation.
Established diagnostic indicators for stability that guide additional runs or human review based on classification strata.

Abstract

Generative Artificial Intelligence (AI) is increasingly used for zero-shot text classification in social science, yet its outputs exhibit inherent stochasticity. Because reliability is a necessary condition for validity in content analysis methodology, this stochasticity poses a fundamental challenge, yet no systematic framework exists for quantifying and governing classification reliability prior to validity evaluation. This study proposes the Semantic Stability Protocol, which conceptualizes repeated large language model (LLM) outputs as structured groups of “AI coders” and applies traditional intercoder reliability metrics to assess classification consistency. Using DeepSeek Reasoner to classify 424 Chinese news articles into five categories within a single-model, single-language, single-domain configuration (100 runs per article), we find that raw outputs already exhibit high internal consistency (Krippendorff’s α = 0.8485) and that approximately 20 runs suffice for α > 0.94 after aggregation. Central to the protocol is a stability-stratified escalation framework: two diagnostic indicators, the Majority Rate and the Confidence Gap, partition each classification into High-, Moderate-, or Low-stability strata, triggering differentiated procedures: High-stability cases accept aggregated decisions directly, Moderate-stability cases undergo additional runs to reassess consistency, and Low-stability cases are flagged for human review. This study illustrates that generative model stochasticity can be governed within established reliability frameworks, providing researchers with actionable guidance (minimum run counts, aggregation strategy selection, and stability diagnostics) for transforming zero-shot classification into a transparent, auditable procedure.

Bookmark

View Full Paper

Cite This Study

Hsu et al. (Thu,) studied this question.

synapsesocial.com/papers/69fecfcdb9154b0b82876c3a https://doi.org/https://doi.org/10.1007/s11135-026-02832-9

Bookmark

View Full Paper