What question did this study set out to answer?

The research aims to evaluate user trust and the reliability of LLMs in information retrieval systems.

May 7, 2026Open Access

Evaluating Hallucination and Trust in LLM-Enhanced Information Retrieval Systems

Key Points

The research aims to evaluate user trust and the reliability of LLMs in information retrieval systems.
Investigates hallucination and trust in LLM-augmented information retrieval.
Evaluates performance using fact-checking benchmarks and datasets.
Employs semantic similarity metrics to assess alignment with ground truth.
Demonstrates the impact of hallucination rates on user trust.
Highlights improvements in accessibility with LLM integration.
Establishes metrics for evaluating transparency and reliability.

Abstract

The integration of Large Language Models (LLMs) into information retrieval sys tems has transformed the user experience by providing direct, conversational responses instead of traditional ranked lists of search results. This modification raises substantial concerns about user trust, behaviour, and the risk of misinformation, even as it improves accessibility and convenience. This thesis investigates the impact of generative information retrieval on the reliability of synthesized answers, particularly focusing on how hallucination rates and semantic drift influence trust dynamics and information-seeking behavior. By evaluating the performance of different LLMs on fact-checking benchmarks, the study seeks to quantify the advantages of model scaling against the inherent risks of factual inaccuracy. The study evaluates hallucination and user trust in LLM-augmented information retrieval systems using three fact-checking datasets. Three well-known semantic similarity metrics are employed to assess the alignment between LLM responses and ground-truth references. Furthermore, the hallucination rate and factual consistency are assessed by aligning model-generated responses with verified annotations in fact-checking datasets. We utilise bias detection measures to evaluate implicit stereotype reinforcement in LLM outputs. This study applies a comprehensive framework for evaluating and auditing hallucinations by combining quantitative performance metrics with user-level reliability insights. The work aims to establish a baseline for the transparency and reliability of LLMs in search and retrieval contexts.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper