Legal large language models (LLMs) deployed in high-stakes judicial settings must exhibit robustness against non-substantive linguistic variations while preserving acute sensitivity to legally determinative facts and norms. This study investigates this robustness–reliability trade-off within the context of Chinese legal tasks. We curate a dataset of 5,000 Chinese judicial question–answer pairs and generate semantic-preserving adversarial rewrites, retaining only those validated by an embedding-based semantic consistency filter. Holding the total training budget and fine-tuning protocol constant, we fine-tune model variants that differ exclusively in their injection ratio of these verified rewrites, establishing seven distinct injection groups (G0–G6). We evaluate model reliability utilizing a composite protocol that incorporates objective accuracy on exam-style questions, expert evaluations of open-ended responses, and embedding-based semantic similarity. For trademark infringement reasoning tasks, we additionally assess verdict accuracy and rationale quality. Across varying model capacities (4B, 20B, and 32B backbones, including Qwen3-4B, GPT-OSS-20B, and Qwen3-VL-32B-Instruct) and both evaluated tasks, our findings reveal an inverted‑U relationship: moderate robustness data injection enhances reliability, whereas excessive injection degrades overall performance and induces characteristic failure modes, such as the attenuation of legally salient distinctions, the generation of boilerplate rationales, and overly cautious abstention. These findings substantiate “moderate robustness injection” as a practical heuristic and underscore a broader principle of differential sensitivity—achieving insensitivity to superficial variations without blunting the model’s sensitivity to legally decisive elements.
Liu et al. (Thu,) studied this question.