Does a long-running LLM agent become something different from the agent that was deployed? After two years building agent systems, that was the question I kept coming back to, and this paper is an attempt to answer it empirically. The paper proposes a nine-point specification that distinguishes an agent from a workflow, then compares the two head-to-head: same model, same system prompt, same tools, one with persistent memory, one without. Across two paired customer-support experiments, first on synthetic tickets and then on Bitext real-world tickets with three paired sessions each, the memory-equipped agent was 3.7 to 4.9 times more efficient on tool calls per ticket, with the gap widening across the session in both runs. On pairwise judging of the same probe responses, the stateless control won 74.5 to 80.7% of decisive comparisons, a result that absolute scoring missed entirely. The mechanism turned out to be specific enough to name: the “New Guy Syndrome,” the tendency of memoryless agents to stay in an apologetic, over-accommodating, customer-friendly register, the way a new hire compensates for missing context with extra politeness and hedging. The register difference shows up clearly in the judge’s preferences on tone and customer-friendliness and is invisible to static benchmarks, every entry of which is itself a new-guy interaction. The conclusion is that the deployment question for long-running agents is not whether to stop drift, but which axes of drift to keep, which to contain, and how to tell the difference.
Prakash Mel Krishnan (Fri,) studied this question.