What question did this study set out to answer?

January 26, 2026

Entity recognition of ancient Chinese books based on semantic association and internal structural features among words

Key Points

The central aim is to improve named entity recognition (NER) for ancient Chinese texts through a robust framework.
Developed the GARNET framework integrating a domain-specific pre-trained model and W2NER.
Utilized sliding-window data augmentation and ensemble learning strategies.
Conducted experiments on three historical datasets using fivefold cross-validation.
Achieved F1 scores of 85.04% for Records of the Grand Historian, 90.28% for Twenty-Four Histories, and 84.49% for traditional Chinese medicine classics.
Improved boundary detection with a 5.73% average F1 gain from W2NER.
Demonstrated a 3.27% F1 improvement through sliding-window data augmentation.

Abstract

Purpose The purpose of this paper is to address challenges in named entity recognition (NER) for ancient Chinese texts—such as semantic complexity, ambiguous entity boundaries and syntactic divergence from modern simplified Chinese characters—by proposing GARNET. This robust framework enhances NER accuracy, thereby advancing structured knowledge extraction and digital humanities research. Design/methodology/approach This study integrates a domain-specific pre-trained model (GujiRoBERTa), word-pair relation modelling (W2NER), sliding-window data augmentation and ensemble learning into a new framework GARNET. The W2NER layer uses dilated convolutions and biaffine transformations to model intra-entity structural and semantic relationships. Experiments are conducted on three datasets: Records of the Grand Historian, Twenty-Four Histories and traditional Chinese medicine classics, with fivefold cross-validation, sliding-window data augmentation and ensemble strategies for performance evaluation. Findings GARNET achieves state-of-the-art F1 scores of 85.04% (Records of the Grand Historian), 90.28% (Twenty-Four Histories) and 84.49% (traditional Chinese medicine classics), yielding an overall improvement of 6.18% over the baseline model. Model comparison experiments confirm the contributions of core components: W2NER improves boundary detection with an average F1 gain of 5.73%, while the ensemble strategy reduces prediction bias and stabilizes performance. Furthermore, ablation studies demonstrate the effectiveness of our proposed sliding window data augmentation mechanism for identifying low-resource and low-frequency entities to improve overall recognition performance, achieving an F1 improvement of 3.27%. Originality/value This study pioneers the application of W2NER to NER in ancient Chinese texts, addressing boundary ambiguities through structural analysis. The sliding-window data augmentation mechanism is particularly effective for identifying low-resource and low-frequency entities. The ensemble strategy not only proves effective within the source domain but also successfully transfers its advantages to unseen data. The proposed framework provides a novel solution for extracting structured knowledge from ancient Chinese texts, with implications for historical research and cultural heritage digitization.

Ask AI

Helpful

Bookmark