What does this research mean for the field?

Fine-tuning large language models on the InsQABench benchmark significantly improves their performance in Chinese insurance question answering tasks. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.ESTABLISHES_NEW_DIRECTION.

What question did this study set out to answer?

The study aims to establish a benchmark for evaluating large language models in the Chinese insurance question answering domain.

February 22, 2026Open Access

InsQABench: Benchmarking Chinese insurance domain question answering with large language models

Key Points

The study aims to establish a benchmark for evaluating large language models in the Chinese insurance question answering domain.
Developed InsQABench with 95K QA pairs from real insurance documents
Defined three specialized QA tasks relevant to insurance
Proposed SQL-ReAct and RAG-ReAct frameworks to improve LLM performance
Fine-tuning on InsQABench improved model accuracy by up to 5.11%
Demonstrated performance evaluation of mainstream LLMs under fine-tuned and zero-shot settings
Enhanced specific task performance using the proposed frameworks

Abstract

• Introduces InsQABench, the first benchmark for Chinese insurance QA with LLMs. • Defines three specialized QA tasks covering structured and unstructured knowledge. • Proposes SQL-ReAct and RAG-ReAct to enhance LLM performance in insurance tasks. We present InsQABench-the first comprehensive benchmark for evaluating LLMs’ capabilities in Chinese insurance QA. InsQABench comprises 95K carefully curated QA pairs derived from real-world insurance documents, covering 3 distinct tasks, 44 question types, and 55 specialized insurance topics. Our experiments evaluated and reported the performance of mainstream LLMs under both fine-tuned and zero-shot settings, demonstrating that fine-tuning on InsQABench can significantly improve model performance. We also introduced two frameworks that further enhanced task-specific performance, achieving 4.91% and 5.11% enhancement in accuracy over the next best-performing model.

InsQABench: Benchmarking Chinese insurance domain question answering with large language models

Key Points

Abstract

Cite This Study