What question did this study set out to answer?

This work aims to enhance named entity recognition in historical Japanese documents by leveraging synthetic data generated by large language models.

June 20, 2026

Leveraging LLM-generated Synthetic Data for Low-Resource Named Entity Recognition in Historical Japanese Documents

Key Points

This work aims to enhance named entity recognition in historical Japanese documents by leveraging synthetic data generated by large language models.
Investigated NER on Yakusha Hyōbanki, a collection of historical texts.
Developed a multi-stage training framework including masked language modeling and NER training on synthetic data.
Conducted experiments across seven model architectures with varying synthetic data scales.
Synthetic-data augmentation consistently improved NER performance over a baseline method.
Limited gains were observed from domain-adaptive pretraining when high-quality synthetic data was used, highlighting a trade-off between cost and accuracy.
A two-stage synthetic-to-real pipeline was found to be an effective strategy for low-resource historical Japanese NER.

Abstract

Named entity recognition (NER) for historical Japanese documents remains challenging because pretrained encoders developed on modern corpora do not align well with historical orthography, including variant and old-form characters, archaic vocabulary, and domain-specific discourse conventions. At the same time, large language models (LLMs) provide strong generative and generalization capabilities that can help address annotation scarcity, while BERT-style encoders remain efficient and robust for sequence labeling. In this work, we investigate NER on Yakusha Hyōbanki, a collection of kabuki actor critique books from the Edo and early Meiji periods in Japan (1603–1912), and propose a multi-stage training framework that combines the advantages of both paradigms. The framework consists of: (i) optional domain-adaptive masked language modeling (MLM) on unlabeled in-domain text to reduce the gap between modern and historical Japanese; (ii) intermediate NER training on LLM-generated synthetic data tailored to the target schema, which improves learning for low-resource entity types; and (iii) final fine-tuning on expert-annotated data to align model predictions with domain-specific annotation guidelines. Experiments across seven model architectures and multiple synthetic-data scales show that synthetic-data augmentation consistently improves performance over a one-stage baseline, whereas adding the MLM stage yields only limited gains when high-quality synthetic data is available, revealing a practical trade-off between computational cost and accuracy. Overall, our results suggest a simple and effective recipe for low-resource historical Japanese NER: a two-stage synthetic → real pipeline serves as a strong default, while domain-adaptive pretraining can be introduced selectively when needed. We release our code at: https: //github. com/BohaoWu/YakushahyobankiNER.

Bookmark

Leveraging LLM-generated Synthetic Data for Low-Resource Named Entity Recognition in Historical Japanese Documents

Key Points

Abstract

Cite This Study

Also Consider

Also Consider