Evaluating Agentic Optimization on Large Codebases explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Mar 16, 2026

Published

Mar 16, 2026, 11:40 PM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior. We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics. FormulaCode comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling the holistic ability of LLM agents to optimize codebases under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents. Project website at: https://formula-code.github.io

Open the original arXiv page

Score 63Full-paper briefagentsmodelsinferencedata

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

This paper is less about “can AI write code” and more about whether coding agents can do the kind of repository-wide performance work that would actually reduce engineering cost on mature software. The answer, based on a more realistic benchmark than most of the field uses, is: partly yes, but not reliably enough to trust unattended—agents do deliver real speedups, yet still trail human experts, especially when the fix requires cross-file reasoning and careful trade-offs across many workloads. If that holds in practice, engineering, platform, and procurement teams should stop treating agentic code optimization as a near-term autopilot capability and start treating it as a selective co-pilot workflow where model choice, agent design, and validation discipline matter more than demo quality.

If you were assuming coding agents are close to autonomously tuning large codebases, this paper pushes back hard. All evaluated configurations achieved speedups over the original code, but every agent still underperformed human experts on the paper’s human-relative advantage metric, which is a better proxy for production usefulness than a single win on one benchmark task.
The paper shows performance depends heavily on agent framework and task shape, not just model branding: some setups do better on module-level refactors, others on function-level edits, and longer reasoning chains can erase the apparent cost advantage of cheaper models. A practical buying question is whether a vendor is strong at local hot-path tuning, broader repo navigation, or both—and what validation loop they run before claiming savings.
The encouraging part is that agents are already useful on localized optimization work. They are stronger on function-level changes, can sometimes beat the human patch with extra micro-optimizations, and do well on tactics like batching or parallelization; they are much weaker when the best fix requires vectorization, lower-level system changes, or negotiating trade-offs across many workloads.
A notable operational warning: agents spent far more tool calls benchmarking than testing, which helps explain why optimization gains can come with correctness risk. Adoption signals that matter more than leaderboard placement are whether vendors or internal teams can show strong regression testing, multi-workload evaluation, and cost-aware decision rules before merging agent-generated changes.
The benchmark itself is a substantive contribution—957 real tasks from 70 repositories with roughly 265 workloads per task is much closer to real engineering constraints than synthetic pass/fail coding tests. But the headline results were run on FORMULA_CODE-V due to compute limits, so the paper is best read as a credible capability map and workflow warning, not definitive proof of how these agents will perform across enterprise codebases.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.5p.5

Agents improve runtime over baseline code but still lag human experts on repository-scale optimization.

capabilityhighp.2p.5

Agents are more effective on local/function-level optimization than broader repository-level changes.

inferencemediump.2p.32

Agent performance and economics depend materially on framework design and reasoning trajectory length, not only the underlying model.

stackhighp.1p.8

FormulaCode is a relatively realistic benchmark built from real repositories and many workloads per task, making it more informative than simpler code benchmarks.

caveathighp.4p.19

The paper should not be read as proof that autonomous code optimization is market-ready because evaluations were run on FORMULA_CODE-V due to compute constraints.