What question did this study set out to answer?

The aim is to develop a retrieval-augmented generation model integrated with LiverTox data for improved decision support in drug-induced liver injury.

February 14, 2026Open Access

Development of retrieval-augmented generation–based large language model for drug-induced liver injury using Livertox data

Key Points

The aim is to develop a retrieval-augmented generation model integrated with LiverTox data for improved decision support in drug-induced liver injury.
Processed 1343 LiverTox drug monographs into 8759 indexed segments using BioBERT embeddings.
Developed a retrieval-augmented generation pipeline with drug-specific prioritization and semantic search.
Evaluated twenty-five DILI questions across various RAG and non-RAG models using 5-point Likert scales.
Used blinded hepatologist evaluations for assessing model responses on accuracy, completeness, and conciseness.
GPT-4o (RAG) achieved the highest overall score of 4.47±0.10.
RAG-LLMs significantly outperformed non-RAG GPT-4o variants in accuracy (p<0.001) and completeness (p<0.01).
Moderate to large effect sizes in accuracy (d=0.778) and completeness (d=0.526) were noted for RAG outputs.
No hallucinations were reported in RAG-LLM outputs, while non-RAG variants had hallucinated responses.

Abstract

Background: Idiosyncratic DILI is a complex clinical challenge requiring timely and accurate decision support. LiverTox, curated by the National Institute of Health (NIH), offers a comprehensive DILI evidence base, but its encyclopedia-like format hinders point-of-care use. Health care providers increasingly use general large language models (LLMs) for clinical care, raising safety concerns due to LLM hallucinations or misinformation. We hypothesize that retrieval-augmented generation (RAG) integration—grounding LLM responses in LiverTox content—would enable accurate DILI decision support. Methods: We processed 1343 LiverTox drug monographs into 8759 indexed segments using BioBERT embeddings. We developed a RAG pipeline that employs drug-specific prioritization, section-aware weighting, and semantic search to retrieve the most relevant content per query. Twenty-five DILI questions were evaluated across 6 models: 4 RAG-LLMs: Mistral-7B, Claude-3-Haiku, Claude-3-Opus, and GPT-4o, and 2 non-RAG GPT-4o variants (unconstrained; soft constrained with a prompt to reference LiverTox). Three hepatologists, blinded to the model, evaluated responses for accuracy, completeness, and conciseness using 5-point Likert scales. Analyses included pairwise comparisons and effect size estimation. Results: One hundred fifty model responses were evaluated with good inter-rater reliability. GPT-4o (RAG) achieved the highest overall scores (4.47±0.10). RAG-LLMs outperformed non-RAG GPT-4o variants in accuracy ( p <0.001) and completeness ( p <0.01). Moderate to large effect sizes in accuracy (d=0.778) and completeness (d=0.526) were noted with RAG. No hallucinations were observed in RAG-LLM outputs, while both non-RAG GPT-4o variants produced several hallucinated responses. There were no significant differences in scoring or hallucinated response rate between the 2 non-RAG variants. Conclusions: We developed an RAG-LLM integrated with LiverTox for evidence-based DILI management. RAG-LLM systems outperformed non-RAG variants and produced responses without observed hallucinations in this evaluation. Our LiverTox RAG-LLM enables reliable answers to drug hepatotoxicity questions at the point of care.

Development of retrieval-augmented generation–based large language model for drug-induced liver injury using Livertox data

Key Points

Abstract

Cite This Study