What question did this study set out to answer?

This research aims to explore the balance between robustness and reliability in Chinese legal large language models.

March 28, 2026Open Access

The trade-off between robustness and reliability in chinese legal large language models: an empirical study

Key Points

This research aims to explore the balance between robustness and reliability in Chinese legal large language models.
Curated a dataset of 5,000 Chinese judicial question-answer pairs
Generated adversarial rewrites to test model robustness
Fine-tuned model variants based on different levels of data injection
Evaluated model reliability using accuracy, expert reviews, and semantic similarity
Assessed verdict accuracy and rationale quality in trademark infringement tasks.
Moderate robustness data injection improves model reliability
Excessive data injection results in decreased performance
Inverted-U relationship observed; optimal injection enhances sensitivity to legal distinctions
Characteristic failure modes included boilerplate rationales and overly cautious responses

Abstract

Legal large language models (LLMs) deployed in high-stakes judicial settings must exhibit robustness against non-substantive linguistic variations while preserving acute sensitivity to legally determinative facts and norms. This study investigates this robustness–reliability trade-off within the context of Chinese legal tasks. We curate a dataset of 5,000 Chinese judicial question–answer pairs and generate semantic-preserving adversarial rewrites, retaining only those validated by an embedding-based semantic consistency filter. Holding the total training budget and fine-tuning protocol constant, we fine-tune model variants that differ exclusively in their injection ratio of these verified rewrites, establishing seven distinct injection groups (G0–G6). We evaluate model reliability utilizing a composite protocol that incorporates objective accuracy on exam-style questions, expert evaluations of open-ended responses, and embedding-based semantic similarity. For trademark infringement reasoning tasks, we additionally assess verdict accuracy and rationale quality. Across varying model capacities (4B, 20B, and 32B backbones, including Qwen3-4B, GPT-OSS-20B, and Qwen3-VL-32B-Instruct) and both evaluated tasks, our findings reveal an inverted‑U relationship: moderate robustness data injection enhances reliability, whereas excessive injection degrades overall performance and induces characteristic failure modes, such as the attenuation of legally salient distinctions, the generation of boilerplate rationales, and overly cautious abstention. These findings substantiate “moderate robustness injection” as a practical heuristic and underscore a broader principle of differential sensitivity—achieving insensitivity to superficial variations without blunting the model’s sensitivity to legally decisive elements.

Mark Helpful

Bookmark

Relay

View Full Paper