What question did this study set out to answer?

The aim is to enhance AI agents' ability to retain and utilize procedural knowledge across tasks.

March 27, 2026Open Access

Procedural Skill Memory for LLM Agent Systems: Architecture, Benchmark, and Honest Limits of a First Implementation

Key Points

The aim is to enhance AI agents' ability to retain and utilize procedural knowledge across tasks.
Implemented the OrKa Brain prototype for procedural skill memory in LLM agent systems.
Designed a benchmark with 30 tasks evaluating cross-domain and same-domain performance.
Utilized LLM-as-judge evaluation protocol for assessing agent performance.
Demonstrated a 63.3% win rate for the Brain-augmented condition in pairwise comparisons.
Noted significant gains in perceived trustworthiness (19 out of 28 wins).
Identified ceiling effects with minimal overall score improvements (+0.10 on a 10-point scale).

Abstract

Current AI agent systems operate primarily as stateless executors: they do not retain procedural experience across tasks. I propose five desirable properties for experience-driven agent systems and present OrKa Brain, an open-source prototype that implements a procedural skill memory loop (learn, persist, retrieve, apply, feedback, decay) within a YAML-based LLM agent orchestration framework. I evaluate the system on a 30-task benchmark across two tracks (cross-domain transfer and same-domain accumulation) using an LLM-as-judge evaluation protocol. Results show a consistent but modest advantage for the Brain-augmented condition: 63.3% pairwise win rate, with the strongest signal in perceived trustworthiness (19/28 wins). Absolute rubric deltas remain small (+0.10 overall on a 10-point scale), revealing a ceiling effect: the underlying LLM already possesses the procedural knowledge the Brain recalls. The current implementation uses rule-based keyword extraction rather than semantic understanding, and the benchmark carries significant confounds (unequal pipeline lengths, single model, single run). I report both the positive signals and the negative ones, identify the bottlenecks, and outline the architectural slots designed for progressive upgrade.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Marco Somma

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Procedural Skill Memory for LLM Agent Systems: Architecture, Benchmark, and Honest Limits of a First Implementation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider