What question did this study set out to answer?

This research proposes DILIConsult, a framework for using large language models to evaluate drug-induced liver injury in ICU settings.

March 15, 2026

DILIConsult : A Multi‐Agent Large Language Model Framework for Evaluating Drug‐Induced Liver Injury in ICU Settings

Key Points

This research proposes DILIConsult, a framework for using large language models to evaluate drug-induced liver injury in ICU settings.
Developed DILIConsult using GPT-4 models for DILI characteristic extraction
Compared full-length case analysis with sequential drug-specific evaluations
Evaluated responses based on criteria from established liver disease organizations
DILIConsult showed improved performance in extracting DILI characteristics
Achieved high mean rank in knowledge recall compared to clinician panel
Identified limitations in minimizing information loss and content accuracy

Abstract

ABSTRACT Background Large language models (LLMs) can support clinical decision‐making by parsing databases and extracting relevant information. However, evaluating drug‐induced liver injury (DILI) often requires processing lengthy clinical histories alongside reference materials like LiverTox, which can exceed context lengths of conventional LLMs. Challenges such as information truncation hinder standard approaches like prompt engineering and retrieval‐augmented generation (RAG). To address these limitations, this study introduces DILIConsult, an agentic LLM pipeline based on GPT‐4, designed to intelligently parse clinical and drug information. Methods To develop DILIConsult, we compared GPT‐4‐Turbo versus GPT‐4o for extracting DILI characteristics from LiverTox descriptions. We tested two approaches to analyzing cases of suspected DILI: full‐length case analysis versus sequential drug‐specific evaluations. We evaluated DILIConsult on cases of suspected DILI identified from the open source Medical Information Mart for Intensive Care‐IV (MIMIC‐IV) ICU dataset based on American Association for the Study of Liver Diseases (AASLD) and European Association for the Study of the Liver (EASL) criteria. Outputs from DILIConsult were compared against a panel of clinicians comprising an ICU pharmacist, an ICU junior attending physician, and an ICU resident. Responses were evaluated by two senior ICU attending physicians. Results Using GPT‐4o and a sequential approach demonstrated improved performance in the extraction of DILI characteristics and analysis of suspected DILI. DILIConsult achieved the best mean rank of 1.66 ± 0.75 in knowledge recall and ranked second for reasoning (2.00 ± 0.64) and reflection of current medical consensus (2.05 ± 0.62). DILIConsult ranked last with mean ranks of 2.07 ± 0.52 and 2.09 ± 0.72 for less omission of important information and content inaccuracy, respectively. Conclusion DILIConsult demonstrates the potential of LLMs to assist clinicians in evaluating DILI. The findings emphasize the importance of task division in LLM‐driven workflows to minimize information loss.

Bookmark

DILIConsult : A Multi‐Agent Large Language Model Framework for Evaluating Drug‐Induced Liver Injury in ICU Settings

Key Points

Abstract

Cite This Study