arXiv 2605.28775v1May 27, 2026

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Suji Kim, Kangsan Kim, Sung Ju Hwang

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

May 27, 2026, 5:37 PM

Current score

73

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

Score 73Full-paper briefagentstrainingdatamodels

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

Small computer-use agents usually fail in uneven, domain-specific ways; this paper shows a practical route to turning those failures into targeted training rather than throwing generic synthetic data at the problem. If the result holds outside OSWorld, software automation teams could deploy cheaper specialist agents for narrow workflows instead of renting a large expert model for every application. The evidence is meaningful—two 7–8B-class agents improve by about eleven percentage points across eight domains—but still depends on a stronger teacher, controlled environments, and reliable automatic verification.

  • If these results transfer beyond OSWorld, companies may not need a large general-purpose GUI agent for every software workflow. A shared small model plus per-domain adapters could make computer-use automation cheaper to serve, easier to localize, and more acceptable where proprietary API calls are constrained.
  • The paper’s main practical claim is that targeted failure data beats broad synthetic data: the system trains on cases where the teacher succeeds and the student fails. For buyers and builders, the question shifts from “how much agent training data do you have?” to “can you identify the current model’s recurring failure modes and generate practice around them?”
  • A credible computer-use agent stack should be able to show more than demo success rates: it should explain whether failures are planning mistakes, execution mistakes, or environment-state mistakes, and how those failures become training signals. Be skeptical of systems that rely on the agent saying it is done rather than independent trajectory and screenshot verification.
  • This is most ready for bounded software domains where tasks can be replayed and automatically judged, not open-ended desktop work. The reported setup is operationally lightweight for research—under five hours of LoRA fine-tuning on one H200 for 7–8B models, with small seed setup—but still depends on having a controlled environment and verifiable task outcomes.
  • The evidence is strong enough to take the specialization pattern seriously, but it is still benchmark-centered and model-specific. Results depend on a stronger teacher, a target student whose weaknesses can be measured, and reproducible environments; even one common domain, Chrome, was excluded for reproducibility issues.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.7

LearnWeak improves two small computer-use agent backbones by roughly eleven percentage points on average across eight OSWorld domains.

stackhighp.5

The method supports modular per-domain specialization using LoRA adapters on top of a shared frozen base agent.

trainingmediump.1p.16

The paper argues that targeted weakness-focused data is more useful than broad synthetic data generation for small agent specialization.

caveatmediump.15p.15

The reported specialization loop is operationally plausible but depends on a stronger teacher model and verifier/generation components.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.LG

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Rui Yang et al.

cs.AI

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

Taicheng Guo et al.

cs.LG

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

Lingzhi Yuan et al.

cs.LG

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

Taras Sereda et al.

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark