Background/Objectives: Population pharmacokinetic (PopPK) modeling in NONMEM requires iterative, expertise-dependent workflows. Naïve zero-shot prompting of general-purpose large language models (LLMs) typically produces NONMEM code that fails to execute. This study introduces PKGPT, a recursive agentic LLM system designed to automate NONMEM-based PopPK model development and benchmarks its performance against human expert models. Methods: PKGPT, powered by Google’s Gemini 3.0 Flash, embeds pharmacometrics expertise into phase-specific expert-agent prompts orchestrated across five sequential phases: base model establishment, structural diagnostics, overfitting reduction, random-effects optimization, and covariate analysis. The system recursively executes NONMEM, parses outputs, and iteratively refines control streams. PKGPT was evaluated on three public datasets (warfarin, theophylline, and tobramycin) and benchmarked against independently developed human expert models. Results: PKGPT consistently produced executable, converging NONMEM models across all three datasets. In warfarin, both PKGPT and the human expert selected a one-compartment oral structure (ADVAN2), but the expert achieved a lower OFV (294.41 vs. 484.43) via covariate scaling. In theophylline, PKGPT produced parameter estimates close to the expert solution (Ka = 1.59 vs. 1.46 h−1; CL = 0.0399 vs. 0.0404 L/h/kg). In tobramycin, PKGPT correctly identified a two-compartment structure but produced physiologically implausible peripheral volume estimates (V2 = 149 L vs. expert’s 13.2 L). Across datasets, PKGPT did not identify clinically established covariates, and run-to-run reproducibility was variable. Conclusions: PKGPT substantially improves the robustness and usability of LLM-generated NONMEM code compared with naïve zero-shot prompting, accelerating model drafting and iterative refinement, but physiological plausibility and clinical interpretability still require a human-in-the-loop oversight.
Kwack et al. (Sat,) studied this question.