Brief context
Publication timing, weekly edition context, and source links for this brief.
Original paper
The executive brief below is grounded in the source paper and linked back to the arXiv abstract.
Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools. We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison. The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours. We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.
Executive brief
A short business-reader brief that explains why the paper matters now and what to watch or do next.
Why this is worth your attention
Enterprise agents are often pitched as needing more context and larger models; this paper shows the opposite can be true in a structured ERP workflow. In Microsoft Dynamics 365 expense itemization, keeping only recent tool interactions plus a compact summary beat full-history retention while using roughly one-third of the tokens and less than half the runtime, making context engineering a real cost and reliability lever for finance and operations automation. The evidence is strongest for single-session, tool-heavy workflows with verbose system responses—not a universal deployment rule—but it gives platform and business-systems teams a concrete design issue to press vendors on now.
- For tool-using enterprise agents, full conversation history may be a liability, not just an expense. The paper’s useful insight is that old ERP tool responses can represent stale system state, so disciplined forgetting can improve both accuracy and cost.
- In this benchmark, pruning plus summarization raised complete itemization from 71.0% to 91.6% while cutting total tokens from 1.481M to 553.4K and runtime from 14.56 to 5.79 hours. If similar patterns hold in production workflows, agent economics improve through orchestration design rather than waiting for a cheaper or larger model.
- A serious ERP-agent vendor should be able to explain its memory policy: what gets retained exactly, what gets summarized, when summaries are triggered, and how the window is tuned. “We use a long context model” is not enough if old tool responses can corrupt the agent’s view of the current transaction.
- The paper reports similar directionality across hotel, travel, and meals/gifts expense categories, but the core evidence is still a controlled D365 F&O expense-itemization setup. The next meaningful signal is comparable gains in other structured workflows such as procurement exceptions, order edits, claims handling, or month-end finance operations.
- The result is not a universal agent-memory law: baseline behavior differed sharply by model, and the experiment used a controlled non-interactive harness. The practical takeaway is to test pruning and summarization per workflow and model, not to hard-code “last five plus summary” everywhere.
Evidence ledger
The strongest claims in the brief, along with the confidence and citation depth behind them.
Pruning to the last five tool interactions plus automated summarization produced the best completion and amount-itemized results in the main benchmark.
The best context-engineered configuration materially reduced token use and wall-clock runtime versus full-history retention.
A plausible mechanism is that old ERP tool responses encode superseded form state, which can confuse the agent.
The evidence is credible for the studied workflow but model-dependent and limited in domain breadth.
Related briefs
More plain-English summaries from the archive with nearby topics or operator relevance.
cs.LG
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Rui Yang et al.
cs.DC
Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense
Nataraj Agaram Sundar, Tejas Morabia
cs.SE
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems
Yipeng Ouyang et al.
cs.LG
FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse
Lingzhi Yuan et al.