Named entity recognition (NER) for historical Japanese documents remains challenging because pretrained encoders developed on modern corpora do not align well with historical orthography, including variant and old-form characters, archaic vocabulary, and domain-specific discourse conventions. At the same time, large language models (LLMs) provide strong generative and generalization capabilities that can help address annotation scarcity, while BERT-style encoders remain efficient and robust for sequence labeling. In this work, we investigate NER on Yakusha Hyōbanki, a collection of kabuki actor critique books from the Edo and early Meiji periods in Japan (1603–1912), and propose a multi-stage training framework that combines the advantages of both paradigms. The framework consists of: (i) optional domain-adaptive masked language modeling (MLM) on unlabeled in-domain text to reduce the gap between modern and historical Japanese; (ii) intermediate NER training on LLM-generated synthetic data tailored to the target schema, which improves learning for low-resource entity types; and (iii) final fine-tuning on expert-annotated data to align model predictions with domain-specific annotation guidelines. Experiments across seven model architectures and multiple synthetic-data scales show that synthetic-data augmentation consistently improves performance over a one-stage baseline, whereas adding the MLM stage yields only limited gains when high-quality synthetic data is available, revealing a practical trade-off between computational cost and accuracy. Overall, our results suggest a simple and effective recipe for low-resource historical Japanese NER: a two-stage synthetic → real pipeline serves as a strong default, while domain-adaptive pretraining can be introduced selectively when needed. We release our code at: https: //github. com/BohaoWu/YakushahyobankiNER.
Wu et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: