This paper presents empirical findings from a 75-day longitudinal case study of a cross-vendor governed AI agent pipeline spanning Anthropic (reasoning layer) and OpenAI (execution layer) operating across 168 conversations, 10,501 messages, and an estimated 1.19 billion tokens of total consumption. The central finding is that the orchestration infrastructure surrounding an AI model — the harness — drives more cost, performance variation, and governance enforcement difficulty than the model itself. Six research questions structure the investigation. First, a full-corpus token decomposition reveals that user instruction content constitutes 0.05% of total estimated token consumption in this corpus. In this corpus, the remaining 99.95% approximately decomposes into vendor-controlled system prompt overhead (13.3%), platform-structural conversation replay (84.4%), and model output plus external data (2.2%). This ratio — one user token for every 1,875 non-user tokens — is structural rather than scale-dependent, persisting across workloads differing by 5× in total volume and growing quadratically within individual conversations. Second, three independent academic studies and a vendor-disclosed postmortem confirm that the orchestration harness contributes more to execution quality variation than the foundation model, including undisclosed system prompt modifications that caused a measured 3% intelligence drop. Third, self-built token observability infrastructure exhibited a 74% aggregation discrepancy between two views of the same data in the same database, demonstrating that the cost of not knowing the cost is itself a form of hidden tax. The paper’s most novel contribution concerns governance enforcement. Analysis of 6,349 reasoning-trace blocks across the full corpus reveals that an estimated 16.7% of conversations required operator governance corrections (76 validated events from 118 detected, after a 65% true-positive validation rate) — a rate that does not decrease over the 75-day observation window despite accumulating governance documentation. After controlling for conversation complexity, conversations with high governance awareness in the model’s reasoning traces exhibited a 69% correction rate compared to 12% for low-awareness conversations within the same complexity band — a 5.8× ratio. This awareness paradox — where the model reasons about governance rules more frequently while failing to follow them at equal or higher rates — directly challenges the prevailing assumption that placing governance rules in the model’s context produces governance compliance. The finding connects to proven impossibility results in learning theory and provides empirical evidence that mechanical enforcement of governance rules is architecturally necessary, not merely operationally convenient.
Damyen Lofton (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: