The integration of Large Language Models (LLMs) into information retrieval sys tems has transformed the user experience by providing direct, conversational responses instead of traditional ranked lists of search results. This modification raises substantial concerns about user trust, behaviour, and the risk of misinformation, even as it improves accessibility and convenience. This thesis investigates the impact of generative information retrieval on the reliability of synthesized answers, particularly focusing on how hallucination rates and semantic drift influence trust dynamics and information-seeking behavior. By evaluating the performance of different LLMs on fact-checking benchmarks, the study seeks to quantify the advantages of model scaling against the inherent risks of factual inaccuracy. The study evaluates hallucination and user trust in LLM-augmented information retrieval systems using three fact-checking datasets. Three well-known semantic similarity metrics are employed to assess the alignment between LLM responses and ground-truth references. Furthermore, the hallucination rate and factual consistency are assessed by aligning model-generated responses with verified annotations in fact-checking datasets. We utilise bias detection measures to evaluate implicit stereotype reinforcement in LLM outputs. This study applies a comprehensive framework for evaluating and auditing hallucinations by combining quantitative performance metrics with user-level reliability insights. The work aims to establish a baseline for the transparency and reliability of LLMs in search and retrieval contexts.
Aswin Panthithara Suresh (Thu,) studied this question.