What question did this study set out to answer?

The aim is to improve the reliability and trustworthiness of large language models in healthcare analytics.

June 3, 2026Open Access

Constrained-Execution Agentic Analytics for Regulated Data: An Architecture Pattern for Trustworthy LLM Cohort Analytics, with a Reproducibility Case Study (Deterministic Dispatch: 33% → 100%)

Key Points

The aim is to improve the reliability and trustworthiness of large language models in healthcare analytics.
Developed Claim-FM, a prototype using constrained execution with typed primitives.
Utilized a crew of four agents (Planner, Analyst, Critic, Responder) for distinct roles.
Evaluated performance on a synthetic benchmark of 2,300 patient cases.
Deterministic dispatch improved plan reliability from 33% to 100% at no extra cost.
An adversarial critic on high-stakes intents balanced trust and latency for 30% of queries.
Fine-tuning the Planner on an audit log improved its benchmark score from 1.8 to 91.6.

Abstract

Large language model agents are increasingly used for analytics over regulated healthcare claims data, yet the common designs — a single tool-using "super agent", or a free-form crew of agents that writes and runs code — are hard to reproduce, audit, and trust. We present Claim-FM, a working prototype for healthcare claims analytics in which language models are used for planning and interpretation while every data operation is executed through a constrained set of typed, deterministic primitives: the models reason, and the primitives compute. A crew of four agents (Planner, Analyst, Critic, Responder) is driven by a runtime that owns the shape of the pipeline; the agents do not call each other, and no model ever produces a reported number — a model decides what to run and explains the result, while typed primitives compute every value. We report four empirical results. (1) Deterministic dispatch — replacing free-form hand-off with a fixed runtime path — raises multi-step plan reliability from 33% to 100% on a synthetic benchmark at no additional cost; the effect is at the architecture level, not the model level. (2) A critic that runs adversarially only on high-stakes intents (about 30% of queries) bounds the trust-versus-latency cost. (3) A training-shaped audit log makes every run replayable and directly usable as fine-tuning data; fine-tuning the Planner (Qwen3-4B with LoRA) on this audit-shaped corpus moves a 195-case planner benchmark from 1.8 to 91.6. (4) A transferable deployment failure: the fine-tuned model silently lost accuracy in production when it was served a prompt that differed structurally from the one used to score it, with symptoms that imitated a hardware out-of-memory error and a flaky model; a replay audit over 70 conversations and 136 turns confirmed that the fix held within a fixed configuration (0 OOM; 91.2% raw success). The contributions are the constrained-execution pattern, described openly enough to reproduce; the measured architecture-level reliability result; the training-shaped audit method and the fine-tune it produced; and the prompt-contract failure mode together with the replay method used to diagnose it. Scope is stated plainly: the evaluation uses a synthetic demonstrator of about 2,300 patients, and the separate claims foundation model (a masked-event embeddings transformer) is left untrained. Concurrent commercial systems are cited as prior art, not as head-to-head comparisons. This Zenodo record is the canonical, citable version of record for this work.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Subia Fatima (Mon,) studied this question.

synapsesocial.com/papers/6a1fc616dee9eb8c0dce74c3 https://doi.org/https://doi.org/10.5281/zenodo.20489424

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper