What question did this study set out to answer?

The aim is to evaluate and improve methods for detecting hallucinations in governance-related LLM applications.

May 22, 2026Open Access

A Multi-Metric Evaluation Perspective on Hallucination Detection in Low-Resource Governance Documents

Key Points

The aim is to evaluate and improve methods for detecting hallucinations in governance-related LLM applications.
Reviewed recent studies on hallucination detection and evaluation frameworks.
Identified challenges in multilingual processing and harm-oriented risk assessment.
Proposed an integrated evaluation framework for assessing reliability and safety of LLMs.
Highlighted significant gaps in current methodologies for detecting hallucinations.
Emphasized the need for comprehensive frameworks rather than isolated assessments.
Identified limitations in existing benchmarks focusing on accuracy over reliability.

Abstract

The rapid advancement of Large Language Models (LLMs) has significantly improved natural language processing applications across domains such as governance, healthcare, legal analysis, and public information systems. Despite these advancements, LLMs frequently generate hallucinated outputs, where responses appear plausible but contain incorrect or fabricated information. This issue poses serious risks in governance-related applications, where inaccurate information can influence policy interpretation, administrative decision-making, and public trust. Existing studies have proposed several approaches to address hallucinations, including semantic entropy–based detection, benchmark evaluation frameworks, and adversarial testing methods. However, the literature indicates that current solutions remain fragmented and often focus on isolated aspects such as model performance, dataset construction, or benchmark capability rather than comprehensive reliability assessment. This literature review examines recent research on hallucination detection, multilingual and low-resource natural language processing, and evaluation frameworks for LLM reliability. The reviewed studies highlight key challenges, including the lack of multilingual hallucination evaluation, insufficient harm-oriented risk assessment, and limited adversarial robustness testing in governance contexts. Furthermore, existing benchmarks often measure task accuracy rather than factual reliability or societal impact. Based on the analysis of the literature, this review identifies major methodological and contextual gaps and proposes the need for an integrated evaluation framework combining meaning level hallucination detection, harm aware risk modeling, and multilingual robustness assessment. Such an approach could improve the reliability and safety of LLM systems deployed in governance and public service environments.

Bookmark

View Full Paper

Bookmark

View Full Paper

A Multi-Metric Evaluation Perspective on Hallucination Detection in Low-Resource Governance Documents

Key Points

Abstract

Cite This Study