• Introduces InsQABench, the first benchmark for Chinese insurance QA with LLMs. • Defines three specialized QA tasks covering structured and unstructured knowledge. • Proposes SQL-ReAct and RAG-ReAct to enhance LLM performance in insurance tasks. We present InsQABench-the first comprehensive benchmark for evaluating LLMs’ capabilities in Chinese insurance QA. InsQABench comprises 95K carefully curated QA pairs derived from real-world insurance documents, covering 3 distinct tasks, 44 question types, and 55 specialized insurance topics. Our experiments evaluated and reported the performance of mainstream LLMs under both fine-tuned and zero-shot settings, demonstrating that fine-tuning on InsQABench can significantly improve model performance. We also introduced two frameworks that further enhanced task-specific performance, achieving 4.91% and 5.11% enhancement in accuracy over the next best-performing model.
Wei et al. (Thu,) studied this question.