XSkill: Continual Learning from Experience and Skills in Multimodal Agents explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Mar 9, 2026

Published

Mar 12, 2026, 3:25 PM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.

Open the original arXiv page

Score 81Full-paper briefagentsinferenceinframodels

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

This paper matters because it points to a practical way to make multimodal agents improve from use without retraining the base model: capture what worked as reusable playbooks and tactical prompts, then retrieve them when similar visual tasks show up again. If that holds up in production, it makes agent quality less dependent on constant model fine-tuning and more dependent on who builds the best memory, retrieval, and tool-orchestration layer. The reported gains are real enough to take seriously across multiple benchmarks and models, but this is still an early systems result, not proof that long-running deployed agents reliably compound improvement over many live cycles.

The strongest implication is that some agent improvement may come from externalized memory and workflow reuse rather than model updates. That shifts competitive pressure toward the retrieval layer, tool templates, and knowledge curation pipeline—areas product, platform, and operations teams can actually instrument and control.
A useful buying question is whether a vendor can show a persistent skill and experience layer, not just prompt engineering. This paper’s gains come from storing task-level workflows and action-level tips, then adapting them to new visual contexts; if a vendor cannot explain that loop, their 'learning agent' claims may still just mean stateless tool calling.
The clearest near-term business value is fewer bad tool calls and more disciplined workflows, not magical new reasoning ability. The ablations show skills materially cut syntax and tool-execution errors, which is exactly the kind of improvement that matters for search, document handling, analytics, and other operational agent workflows.
This system is not lightweight: it uses multiple rollouts, long summarization and rewrite contexts, embeddings, retrieval, and separate models for execution versus knowledge management. Before treating this as deployment-ready, look for evidence on latency, token cost, storage growth, and whether the gains still hold when the memory loop runs continuously rather than in a one-shot experiment.
The cross-model result is promising because stronger models can externalize know-how that weaker ones reuse, but the paper also shows transfer can be uneven on open models and may increase exploratory tool use rather than cleanly improve per-try quality. The adoption signal to watch is stable gains across model families without blowing up turns, latency, or error rates.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.1p.6

XSkill improves multimodal agent performance without parameter updates by learning from prior trajectories.

stackhighp.3p.3

The architecture separates execution from knowledge management, implying a stack-level opportunity to pair cheaper runtime models with stronger memory-management models.

capabilityhighp.6p.7p.7

The skill and experience streams play different roles, with skills reducing execution failures and experiences shifting tool-selection behavior.

caveathighp.2p.12p.13

The paper does not yet prove durable, production-style continual learning because experiments use a single accumulation-then-test cycle and omit system cost reporting.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.DC

Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense

Nataraj Agaram Sundar, Tejas Morabia

Read brief arXiv

cs.LG

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Rui Yang et al.

Read brief arXiv

cs.LG

Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

Yavar Yeganeh et al.

Read brief arXiv

cs.AI

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

Abhilasha Lodha et al.

Read brief arXiv