What question did this study set out to answer?

The main aim is to address high professional barriers and costs in domain corpus annotation through automation.

June 4, 2026Open Access

High Quality Automatic Annotation of Domain Corpora Based on Instruction Tuning of Large Language Models

Key Points

The main aim is to address high professional barriers and costs in domain corpus annotation through automation.
Constructed an automatic annotation framework utilizing instruction tuning of large language models.
Developed adaptive instruction templates and a hybrid annotation algorithm combining self-distillation and dynamic verification.
Evaluated the annotation model in financial risk control and medical diagnosis domains.
Achieved an annotation accuracy rate close to 90% in both domains.
Demonstrated significant improvements in domain adaptability and annotation quality.
Provided an efficient solution for rapid construction of domain-specific large models.

Abstract

Aiming at the problems of high professional barriers, huge annotation costs and difficulty in ensuring annotation consistency in constructing domain corpora, an automatic annotation framework based on instruction tuning of large language models is constructed. By building adaptive instruction templates, the general LLM is initially fine-tuned with instructions to obtain a strong initial annotator. Furthermore, a hybrid annotation algorithm of iterative self-distillation and dynamic verification is designed to continuously improve the domain adaptability and annotation quality of the annotation model in the process of automatically generating and screening annotated data. Experiments in two typical domains of financial risk control and medical diagnosis show that this method can generate high-quality annotated data with an accuracy rate close to 90%, providing an efficient corpus solution for the rapid construction of domain-specific large models.

High Quality Automatic Annotation of Domain Corpora Based on Instruction Tuning of Large Language Models

Key Points

Abstract

Cite This Study