Synthetic Computers at Scale for Long-Horizon Productivity Simulation explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Apr 27, 2026

Published

Apr 30, 2026, 5:58 PM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer -- for example, navigating the filesystem for grounding, coordinating with simulated collaborators, and producing professional artifacts -- until these objectives are completed. In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them; each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average. These simulations produce rich experiential learning signals, whose effectiveness is validated by significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations. Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs. We argue that scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.

Open the original arXiv page

Score 83Full-paper briefagentstrainingdatainfra

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

This paper points to a practical bottleneck in office-work agents: they do not just need better reasoning, they need realistic places to practice—messy folders, partially finished files, collaborator feedback, and month-long commitments. The authors show that synthetic “computers” can generate training signals that improve agent performance, which could make long-horizon productivity automation less dependent on sensitive enterprise data. The catch is cost and realism: each run is still hours-long, synthetic, and judged through a model-heavy stack, so this is more a credible roadmap for agent training infrastructure than a near-term proof of autonomous knowledge work.

The important move is generating whole computer worlds—folders, drafts, spreadsheets, collaborators, revisions—so agents can practice the messy middle of knowledge work. If this approach scales economically, scarce real enterprise telemetry becomes less of a bottleneck for training office-work agents.
The paper’s strongest business-relevant evidence is that lessons extracted from 900 simulations improved held-out synthetic work: mean score rose from 61.6% to 68.6%, with wins on 83 of 100 test computers. The out-of-domain GDPVal result is encouraging, but buyers should look for the same pattern on their own workflows, not just synthetic environments.
The simulations expose exactly the failure modes that matter in regulated or high-stakes workflows: narratives diverging from source workbooks, reviewer corrections not propagating, and tool-use errors accumulating over long runs. A serious agent platform should show reconciliation passes, source-of-truth controls, non-empty-message checks, audit trails, and final QA—not just polished final documents.
The authors show a promising learning loop, but the current evidence still depends on synthetic personas, LLM judges, model-specific tooling, and expensive long simulations. The paper itself flags limits: small training sets can hurt, skill libraries can become unwieldy, and today’s synthetic computers are still cleaner and less socially complex than real workplaces.
This paper argues, implicitly but strongly, that office automation depends on persistent context: file structure, prior decisions, stakeholder messages, and unfinished commitments. Procurement and product teams evaluating agents should test month-like workflows with changing artifacts and collaborators, not isolated prompt-and-answer tasks.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.2

The authors instantiated 1,000 synthetic computers and ran one long-horizon simulation per computer.

inferencehighp.14

Each simulation is materially long-running, averaging 2,272 turns and 8.59 hours.

traininghighp.18

Trajectory-derived skills improved held-out synthetic-computer performance by 7.0 percentage points.

caveathighp.22

The authors acknowledge realism limits in synthetic collaborators and environments.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.LG

AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow

Jiale Liu, Nanzhe Wang

Read brief arXiv

cs.LG

Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

Zijian Zhao, Jing Gao, Sen Li

Read brief arXiv

cs.LG

Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

Hung Cuong Pham, Fatih Gedikli

Read brief arXiv

cs.IR

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

Read brief arXiv