What question did this study set out to answer?

The study investigates whether long-running LLM agents evolve into distinct entities compared to their initial deployment.

May 24, 2026Open Access

Continuants: Toward a Theory of Behavioural Persistence in Long-Running LLM Agents

Key Points

The study investigates whether long-running LLM agents evolve into distinct entities compared to their initial deployment.
Constructed a nine-point specification to differentiate agents from workflows.
Conducted two paired customer-support experiments on synthetic and Bitext real-world tickets.
Compared memory-equipped agents to stateless controls across sessions for efficiency and tone.
The memory-equipped agent was 3.7 to 4.9 times more efficient than the stateless control in tool calls per ticket.
Stateless controls won 74.5 to 80.7% of decisive comparisons in probe response judgments.
Identified the 'New Guy Syndrome' as a key behavioral difference related to memorylessness.

Abstract

Does a long-running LLM agent become something different from the agent that was deployed? After two years building agent systems, that was the question I kept coming back to, and this paper is an attempt to answer it empirically. The paper proposes a nine-point specification that distinguishes an agent from a workflow, then compares the two head-to-head: same model, same system prompt, same tools, one with persistent memory, one without. Across two paired customer-support experiments, first on synthetic tickets and then on Bitext real-world tickets with three paired sessions each, the memory-equipped agent was 3.7 to 4.9 times more efficient on tool calls per ticket, with the gap widening across the session in both runs. On pairwise judging of the same probe responses, the stateless control won 74.5 to 80.7% of decisive comparisons, a result that absolute scoring missed entirely. The mechanism turned out to be specific enough to name: the “New Guy Syndrome,” the tendency of memoryless agents to stay in an apologetic, over-accommodating, customer-friendly register, the way a new hire compensates for missing context with extra politeness and hedging. The register difference shows up clearly in the judge’s preferences on tone and customer-friendliness and is invisible to static benchmarks, every entry of which is itself a new-guy interaction. The conclusion is that the deployment question for long-running agents is not whether to stop drift, but which axes of drift to keep, which to contain, and how to tell the difference.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper