What question did this study set out to answer?

This research aims to develop a training paradigm that enhances the performance of computer use agents by addressing issues of intent blindness and representational mismatch.

May 16, 2026Open Access

Memory Archive: A Memory-Grounded Training Paradigm for Computer Use Agents

Key Points

This research aims to develop a training paradigm that enhances the performance of computer use agents by addressing issues of intent blindness and representational mismatch.
Introduced Memory Archive as a structured training approach for computer use agents.
Implemented a four-stage training pipeline: Pre-Training, Supervised Fine-Tuning, Post-Training, and Inference-Time Retrieval.
Developed a dynamic retrieval system for relevant memories and a mechanism for self-generated evaluations during training.
Enhanced agent performance through the new training paradigm, leading to improved task execution.
Demonstrated effective self-generated memory evaluation, reducing instances of overfitting and underfitting.
Achieved significant alignment between training conditions and deployment performance based on new reward metrics.

Abstract

Memory Archive: A Memory-Grounded Training Paradigm for Computer Use Agents This paper introduces the Memory Archive training paradigm, an end-to-end data architecture and training pipeline that addresses the structural failures of standard Computer Use Agent (CUA) training. Currently, most CUA systems rely on behavioural cloning followed by outcome-supervised RL, leading to intent blindness and a severe representational mismatch between training and deployment formats. The central thesis of this paradigm is Format Consistency. The system centers around a compiled task guide called 'memory.md'—a structured document containing step-by-step procedural reasoning, execution commands, and visual state references. This architecture threads this single artifact through four critical stages of the agent lifecycle: Pre-Training (Format Internalization): The base model learns the grammar of GUI actuation events and step-level multimodal alignment. Supervised Fine-Tuning (SFT): The model is trained with retrieved memories in context, treating actuation artifacts ('CommandEvent' JSONs) as first-class training targets alongside reasoning. Post-Training (Memory Adherence RL): Utilizes Group Relative Policy Optimization (GRPO) driven by a novel three-component reward function (Step Alignment, Visual Grounding, and Outcome Consistency) and a VLM-generated Process Reward Model (PRM). Inference-Time Retrieval: A two-stage retrieval stack (Bi-encoder HNSW + Cross-encoder) dynamically pulls relevant memories. The agent tracks execution deviation and autonomously compiles new 'memory.md' files upon task success, endogenously growing its own training corpus. Furthermore, the paradigm introduces a mechanism for in-training evaluation via self-generated memories, allowing researchers to detect overfitting, underfitting, and context-awareness without relying on static external benchmarks. This document provides full mathematical formulations, data construction specifications, algorithm details, and hyperparameter guidance for implementing the architecture.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper