Memento-Skills: Let Agents Design Agents explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Mar 16, 2026

Published

Mar 19, 2026, 10:45 AM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read--Write Reflective Learning} mechanism introduced in \emph{Memento~2}~\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity's Last Exam} demonstrate sustained gains, achieving 26.2\% and 116.2\% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.

Open the original arXiv page

Score 81Full-paper briefagentstraininginfradata

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

This paper pushes a commercially important idea: instead of retraining models every time an agent learns a new workflow, let the agent build and rewrite its own external skill library at deployment time. If that holds up, teams running agent systems could improve task performance by updating reusable instructions, code, and tool logic rather than paying the cost and delay of model fine-tuning. The reported gains are large on two benchmarks, which makes this more than a conceptual curiosity, but the evidence is still benchmark-bound and transfer is uneven—stronger where tasks share structure, weaker where every task is idiosyncratic.

Revisit the assumption that agent improvement requires model updates. The paper’s core claim is that a frozen LLM can get materially better by rewriting external skill files—prompts, code, and specs—which would shift budget and ownership from model training toward workflow engineering, tool design, and memory governance.
The real adoption signal is not another benchmark win; it is whether this approach works in domains with repeated task structure. The paper itself shows the difference: transfer is much stronger on HLE’s structured subject taxonomy than on GAIA’s diverse tasks, which means operations, support, research, and compliance workflows may benefit sooner than highly bespoke knowledge work.
Ask agent vendors how skills are stored, tested, routed, and rolled back. This system only looks credible because write-backs are not just notes—they directly mutate reusable artefacts, and the retrieval stack mixes sparse and dense search with reranking, so governance and routing quality become product-critical, not implementation trivia.
Do not overread the routing results yet. The router improvements are real, but they are partly shown on 140 synthetic queries generated from an internal skill corpus, and the authors themselves note that clean synthetic queries are not the same thing as messy production requests.
If this direction is right, the bottleneck in enterprise agents moves from raw model capability to maintaining a high-quality skill library. The library grew from 5 atomic skills to 41 after GAIA and 235 after HLE, which is encouraging for capability growth but also a warning that skill sprawl, version control, testing, and security review will become real operating issues.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

traininghighp.1p.3

Continual learning is achieved without updating LLM parameters by evolving external skill memory.

capabilityhighp.8p.15

The system delivered benchmark gains of +13.7 points on GAIA and +20.8 points on HLE versus a static Read-Write baseline.

caveathighp.16p.15

Transfer is uneven: stronger when tasks share structure, weaker on diverse tasks with little reasoning overlap.

stackhighp.5p.12

The retrieval stack and writable skill artefacts are important parts of the performance story, not side details.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.CR

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

Syed Huma Shah

Read brief arXiv

cs.CL

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

Avinash Baidya et al.

Read brief arXiv

cs.LG

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Rui Yang et al.

Read brief arXiv

cs.AI

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

Bojie Rong et al.

Read brief arXiv