What question did this study set out to answer?

The aim is to develop PKGPT, an automated system for NONMEM-based population pharmacokinetic modeling that benchmarks against human experts.

April 22, 2026Open Access

PKGPT: Expert-Orchestrated Recursive LLM Agent for Automated NONMEM PopPK Modeling with Human Benchmarking

Key Points

The aim is to develop PKGPT, an automated system for NONMEM-based population pharmacokinetic modeling that benchmarks against human experts.
Developed PKGPT powered by Google’s Gemini 3.0 Flash for PopPK modeling.
Executed model development in five phases: base model establishment, structural diagnostics, overfitting reduction, random-effects optimization, covariate analysis.
Evaluated against three datasets (warfarin, theophylline, tobramycin) and human expert models.
PKGPT produced executable NONMEM models across all datasets successfully.
In warfarin, the human expert achieved a lower objective function value than PKGPT.
In theophylline, PKGPT's estimates were very close to the expert's, while in tobramycin, it identified the correct two-compartment structure but with implausible volume estimates.

Abstract

Background/Objectives: Population pharmacokinetic (PopPK) modeling in NONMEM requires iterative, expertise-dependent workflows. Naïve zero-shot prompting of general-purpose large language models (LLMs) typically produces NONMEM code that fails to execute. This study introduces PKGPT, a recursive agentic LLM system designed to automate NONMEM-based PopPK model development and benchmarks its performance against human expert models. Methods: PKGPT, powered by Google’s Gemini 3.0 Flash, embeds pharmacometrics expertise into phase-specific expert-agent prompts orchestrated across five sequential phases: base model establishment, structural diagnostics, overfitting reduction, random-effects optimization, and covariate analysis. The system recursively executes NONMEM, parses outputs, and iteratively refines control streams. PKGPT was evaluated on three public datasets (warfarin, theophylline, and tobramycin) and benchmarked against independently developed human expert models. Results: PKGPT consistently produced executable, converging NONMEM models across all three datasets. In warfarin, both PKGPT and the human expert selected a one-compartment oral structure (ADVAN2), but the expert achieved a lower OFV (294.41 vs. 484.43) via covariate scaling. In theophylline, PKGPT produced parameter estimates close to the expert solution (Ka = 1.59 vs. 1.46 h−1; CL = 0.0399 vs. 0.0404 L/h/kg). In tobramycin, PKGPT correctly identified a two-compartment structure but produced physiologically implausible peripheral volume estimates (V2 = 149 L vs. expert’s 13.2 L). Across datasets, PKGPT did not identify clinically established covariates, and run-to-run reproducibility was variable. Conclusions: PKGPT substantially improves the robustness and usability of LLM-generated NONMEM code compared with naïve zero-shot prompting, accelerating model drafting and iterative refinement, but physiological plausibility and clinical interpretability still require a human-in-the-loop oversight.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Kwack et al. (Sat,) studied this question.

synapsesocial.com/papers/69e867136e0dea528ddeb62e https://doi.org/https://doi.org/10.3390/pharmaceutics18040501

Bookmark

View Full Paper