arXiv 2604.14655v1Apr 16, 2026

AgentGA: Evolving Code Solutions in Agent-Seed Space

David Y. Y. Tan, Kellie Chin, Jingxian Zhang

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

Apr 16, 2026, 6:03 AM

Current score

86

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

We present AgentGA, a framework that evolves autonomous code-generation runs by optimizing the agent seed: the task prompt plus optional parent archives that initialize a fresh workspace. The outer loop searches over these reusable starting conditions rather than editing code directly. Each generation launches a fresh autonomous run from a reset workspace, while selected parent archives provide inherited artifacts that descendants can inspect and reuse. AgentGA couples a population-level genetic algorithm with long-horizon agents; selection uses deterministic 1:1 elite tournaments and operator allocation is adapted online with a modified Hedge controller. We instantiate the approach for tabular AutoML on the 16-competition Weco-Kaggle Lite benchmark. On the 10 benchmark runs reported here, AgentGA averages 74.52% Exceeds % of Human versus 54.15% for AIDE. Across 1135 parent-child comparisons, descendants given parent archives outperform runs started from scratch, indicating that inherited artifacts improve later autonomous runs. These findings support agent-seed optimization as a practical design point for autonomous code-search systems.

Score 86Full-paper briefagentsinferenceinfra

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

This paper suggests a practical shift in how autonomous coding systems should be improved: instead of endlessly tweaking generated code or letting agents accumulate messy state, optimize the reusable starting package the agent begins from. In the reported Kaggle-style tabular ML benchmark, that approach beat a strong agent baseline by a wide margin, which matters because it points to a more controllable way to compound progress across runs rather than paying for isolated one-off agent attempts. If this result holds outside tabular AutoML, product, operations, and AI platform teams should expect pressure to build agent systems around reusable workspaces, archives, and replayable workflows—not just better prompts—though the evidence is still early, narrow, and compute-hungry.

  • The paper’s strongest claim is that better inherited starting context beats fresh starts: parent-conditioned operators were consistently more competitive, while de novo Initial proposals won only 9 of 74 tournaments. For buyers and builders, that means the durable advantage may come less from a clever prompt and more from how well a system packages reusable artifacts, notes, and prior experiments for the next run.
  • A useful diligence question is whether an agent platform can launch each run from a clean workspace, selectively import prior artifacts, and deterministically replay archived solutions. That architecture is central here, and it matters for governance, debugging, and knowing whether performance is actually compounding versus just drifting through a long conversation.
  • The results are meaningful because they were judged on Kaggle private leaderboards, not just internal validation, but the setup still depends on a bounded workflow with clear scoring and executable submissions. The adoption signal to watch is whether the same seed-optimization pattern works in other domains with expensive but automatable evaluation loops, such as analytics pipelines, simulation-driven engineering, or internal coding tasks with test harnesses.
  • This is not a lightweight wrapper around an LLM. Even one illustrative agent run took 15 minutes, 63 LLM calls, and heavy token usage, so if this approach becomes real in production, the operational bottleneck will include runtime controls, caching, evaluation infrastructure, and queueing—not just model quality.
  • The benchmark improvement is large in the reported 10 runs, and all 10 beat the AIDE reference, but this is still a preprint marked work in progress, with only part of a 16-task suite completed and no proof yet that the gains transfer across models, domains, or lower-cost settings. Treat it as a credible design direction, not a settled new standard.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.6

AgentGA improves reported benchmark performance versus AIDE on private Kaggle leaderboards.

capabilityhighp.6p.7

Inherited parent archives improve descendant runs relative to fresh starts in many parent-child comparisons.

stackhighp.2p.3

The system relies on fresh workspaces with curated artifact inheritance rather than persistent mutable conversations.

inferencemediump.21p.23

The approach is operationally heavy, requiring many agent steps, LLM calls, and token consumption even in a single example run.

caveathighp.1p.19

Evidence is preliminary and not yet broad enough to establish general-purpose autonomous coding gains.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.LG

Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

Zijian Zhao, Jing Gao, Sen Li

cs.LG

AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow

Jiale Liu, Nanzhe Wang

cs.AI

Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

Khushal Sethi

cs.CR

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang et al.

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark