Large language model agents are increasingly used for analytics over regulated healthcare claims data, yet the common designs — a single tool-using "super agent", or a free-form crew of agents that writes and runs code — are hard to reproduce, audit, and trust. We present Claim-FM, a working prototype for healthcare claims analytics in which language models are used for planning and interpretation while every data operation is executed through a constrained set of typed, deterministic primitives: the models reason, and the primitives compute. A crew of four agents (Planner, Analyst, Critic, Responder) is driven by a runtime that owns the shape of the pipeline; the agents do not call each other, and no model ever produces a reported number — a model decides what to run and explains the result, while typed primitives compute every value. We report four empirical results. (1) Deterministic dispatch — replacing free-form hand-off with a fixed runtime path — raises multi-step plan reliability from 33% to 100% on a synthetic benchmark at no additional cost; the effect is at the architecture level, not the model level. (2) A critic that runs adversarially only on high-stakes intents (about 30% of queries) bounds the trust-versus-latency cost. (3) A training-shaped audit log makes every run replayable and directly usable as fine-tuning data; fine-tuning the Planner (Qwen3-4B with LoRA) on this audit-shaped corpus moves a 195-case planner benchmark from 1.8 to 91.6. (4) A transferable deployment failure: the fine-tuned model silently lost accuracy in production when it was served a prompt that differed structurally from the one used to score it, with symptoms that imitated a hardware out-of-memory error and a flaky model; a replay audit over 70 conversations and 136 turns confirmed that the fix held within a fixed configuration (0 OOM; 91.2% raw success). The contributions are the constrained-execution pattern, described openly enough to reproduce; the measured architecture-level reliability result; the training-shaped audit method and the fine-tune it produced; and the prompt-contract failure mode together with the replay method used to diagnose it. Scope is stated plainly: the evaluation uses a synthetic demonstrator of about 2,300 patients, and the separate claims foundation model (a masked-event embeddings transformer) is left untrained. Concurrent commercial systems are cited as prior art, not as head-to-head comparisons. This Zenodo record is the canonical, citable version of record for this work.
Subia Fatima (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: