Aiming at the problems of high professional barriers, huge annotation costs and difficulty in ensuring annotation consistency in constructing domain corpora, an automatic annotation framework based on instruction tuning of large language models is constructed. By building adaptive instruction templates, the general LLM is initially fine-tuned with instructions to obtain a strong initial annotator. Furthermore, a hybrid annotation algorithm of iterative self-distillation and dynamic verification is designed to continuously improve the domain adaptability and annotation quality of the annotation model in the process of automatically generating and screening annotated data. Experiments in two typical domains of financial risk control and medical diagnosis show that this method can generate high-quality annotated data with an accuracy rate close to 90%, providing an efficient corpus solution for the rapid construction of domain-specific large models.
Fu et al. (Thu,) studied this question.