Abstracted

A weekly digest of the most commercially relevant arXiv papers for operators, PMs, investors, and non-research engineers.

Stay in the loop

Join the Abstracted list

Get a short note when a new curated weekly edition or product update goes live.

  • Reward Modeling for Multi-Agent Orchestration

    King Yeung Tsang et al./arXiv abstract

    Why this is worth your attention

    Multi-agent AI is starting to look less like a contest over who has the biggest agent and more like a control-plane problem: which agent should run, in what order, and when to stop wasting tokens. This paper claims an orchestration-level reward model can score plans before expensive sub-agents execute, cutting training and verification cost while improving benchmark accuracy. If the pattern survives outside these lab benchmarks, agent platforms will compete on routing, selection, and execution discipline as much as on model choice; the evidence is credible enough to test, but not yet strong enough to underwrite production ROI.

  • FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

    Lingzhi Yuan et al./arXiv abstract

    Why this is worth your attention

    Agent workflows are starting to look less like one-off prompt chains and more like an operations problem: build a small library of tested procedures, then route each request to the cheapest one likely to work. FlowBank reports that this “precompute and reuse” approach beats both handcrafted workflows and strong automated baselines across five benchmarks while keeping inference cost competitive. If the pattern survives real workloads, teams building agent systems may get much of the benefit of per-query adaptation without paying to synthesize a new workflow every time; the unresolved question is whether the routing model stays reliable outside clean benchmark distributions.

  • FASE: Fast Adaptive Semantic Entropy for Code Quality

    Shizhe Lin, Ladan Tahvildari/arXiv abstract

    Why this is worth your attention

    AI coding systems do not just need better code generators; they need cheap ways to know when generated code is probably wrong before errors cascade through agents. This paper claims FASE can estimate code-generation uncertainty far faster than LLM-based semantic-entropy checks, with better benchmark correlation to Pass@1 and roughly 0.3% of the runtime cost. If that holds in real software environments, reliability scoring becomes realistic as an always-on routing layer for AI coding workflows rather than an expensive offline audit. The evidence is promising but still benchmark-bound, so treat this as a near-term systems design signal, not a finished enterprise assurance layer.

  • Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

    Abhilasha Lodha et al./arXiv abstract

    Why this is worth your attention

    Enterprise agents are often pitched as needing more context and larger models; this paper shows the opposite can be true in a structured ERP workflow. In Microsoft Dynamics 365 expense itemization, keeping only recent tool interactions plus a compact summary beat full-history retention while using roughly one-third of the tokens and less than half the runtime, making context engineering a real cost and reliability lever for finance and operations automation. The evidence is strongest for single-session, tool-heavy workflows with verbose system responses—not a universal deployment rule—but it gives platform and business-systems teams a concrete design issue to press vendors on now.

  • What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents

    Qinghua Xing et al./arXiv abstract

    Why this is worth your attention

    If agent teams are building libraries of reusable “skills” or runbooks, this paper points to a practical cost lever: rewrite those skills to preserve the operational details that prevent wasted exploration, not merely to make them shorter. In a controlled benchmark, a learned rewrite policy cut total execution cost by 7.0% and downstream agent-token cost by 6.0% while preserving verifier quality, and the same frozen rewrites showed larger average savings across other model stacks. The business implication is that agent cost control may become a knowledge-engineering problem—what instructions, APIs, checks, and formulas must stay visible—rather than a simple prompt-compression exercise, though the evidence is still benchmark-bound rather than production-proven.

  • The Illusion of Multi-Agent Advantage

    Prathyusha Jwalapuram et al./arXiv abstract

    Why this is worth your attention

    Multi-agent AI is often sold as a way to get better reasoning through collaboration, but this paper says the current automated version may mostly be a cost multiplier: more calls, more roles, more routing, and often worse results than a strong single-agent baseline. The business implication is immediate for teams piloting agent platforms: before paying for orchestration complexity, test whether a better base model plus simple self-consistency is cheaper and more reliable. The paper does not kill multi-agent systems; it shifts the burden of proof toward verifiable task decomposition, role contribution, and cost-controlled benchmarks.

  • MiniMax Sparse Attention

    Xunhao Lai et al./arXiv abstract

    Why this is worth your attention

    MiniMax is making a practical attack on one of the biggest blockers to useful long-context AI: the cost of letting a model look across hundreds of thousands or millions of tokens. The paper claims its sparse attention design cuts attention compute by 28.4× at 1M context and turns that into 14.2× prefill and 7.6× decode speedups on H800, while keeping quality broadly close to dense attention in a 109B-scale model. If that generalizes, long-context agents, codebase reasoning, and large-document workflows become materially cheaper to serve; the open question is how much of the gain survives outside MiniMax’s model, kernels, hardware, and benchmark mix.

  • Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

    Yavar Yeganeh et al./arXiv abstract

    Why this is worth your attention

    Semiconductor fabs already run on rules and dispatch heuristics; this paper suggests a more dynamic control layer can learn from event histories and coordinate lot/tool choices across the fab, not just tune a bottleneck station. The reported simulator gains—roughly high-teens to low-20s throughput improvements versus FIFO in several settings—are big enough to matter operationally if they survive live validation. The practical shift is toward digital-twin-trained dispatch agents that can be pre-trained offline and cautiously fine-tuned online, but the evidence is still simulator-based, proprietary, and only partially stress-tested across real production regimes.

  • AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

    Bojie Rong et al./arXiv abstract

    Why this is worth your attention

    Cloud documentation QA is the kind of dull, high-volume work enterprises rarely automate well because agents must touch live systems, satisfy audit rules, and avoid leaking operational data to frontier APIs. This paper claims Alibaba trained a private 32B web agent that gets within 1.82 percentage points of the best proprietary model on 278 real cloud-console tasks while cutting reported inference cost by 92%, and its production pilot already found 4,399 confirmed documentation defects. If the result travels, the opportunity is not just better docs: it points to cheaper private agents for repetitive console operations, compliance checks, and UI-driven back-office workflows. The catch is that much of the win comes from heavy environment engineering, not a plug-and-play model upgrade.

  • A History-Aware Visually Grounded Critic for Computer Use Agents

    Jaewoo Lee et al./arXiv abstract

    Why this is worth your attention

    Computer-use agents fail in ways that are expensive and mundane: they click the wrong button, forget what they already tried, or trust a plan that no longer matches the screen. This paper shows that a separate history-aware, visually grounded critic can catch some of those mistakes before execution, lifting benchmark success across web, mobile, and desktop tasks without requiring DOM or accessibility-tree access. If the result transfers to enterprise software, the near-term opportunity is not fully autonomous office work; it is making GUI automation less brittle by adding a review layer around every action.

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark