arXiv 2603.11863v1Mar 12, 2026

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

Zi-Han Wang et al.

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

Mar 12, 2026, 12:36 PM

Current score

56

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

Score 56PDF-backedmodelsinferencedata

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

The useful shift here is not that models got “more creative,” but that we may finally have a practical way to measure when they produce genuinely new, working solutions instead of polished nonsense. That matters for any team betting on code copilots, autonomous dev tools, or search-based engineering systems: this paper suggests raw model scaling mostly buys safer recombination, not much more true exploration, and that changes how you should evaluate vendors and roadmap automation. The benchmark evidence is stronger than most creativity papers because it uses executable code and human validation, but it is still a code-only research setup, so treat it as an early measurement framework and directional warning, not proof that machine creativity is production-ready across domains.

  • If this benchmark is directionally right, bigger models mainly get better at correct recombination, not at finding unfamiliar solution paths. That challenges a common planning assumption behind premium-model spend: paying more may improve reliability faster than it improves true innovation.
  • A copilot or agent that only reports pass rates may be optimizing for safe sameness. Vendors claiming autonomous engineering or discovery should be able to show how they distinguish working-but-derivative output from genuinely alternative approaches, and whether that measurement is execution-grounded or just an LLM opinion.
  • The paper’s evidence suggests reasoning mode helps more when the job is to navigate around constraints than when the task is to combine ideas across domains. For product and operations teams, that points to nearer-term value in workflow automation with hard rules, compliance constraints, and exception handling, rather than expecting broad creative leaps from prompting alone.
  • The EvoRePE result is not a breakthrough yet, but it hints that some creativity gains may come from inference-time steering rather than retraining or expensive evolutionary search. The adoption signal to watch is whether model platforms start exposing lightweight controls for exploration-vs-correctness tradeoffs, with clear guidance on where those controls help and where they damage pass rates.
  • The evidence is more rigorous than typical AI-creativity work, but it is still confined to Python and to a benchmark built by an automated pipeline with about 89% validated instance quality. That means you should use this to sharpen evaluation and procurement criteria now, not to assume the same conclusions automatically hold for design, marketing, science, or other non-executable creative work.

Evidence ledger

capabilityhighp.2p.5

CreativeBench introduces an executable-code benchmark that measures creativity as quality multiplied by novelty.

capabilityhighp.6p.6p.6

Top models still score below 60% Pass@1 on both benchmark subsets, so the tasks remain difficult even for frontier systems.

strategichighp.8p.8

Scaling improves correctness but tends to suppress divergence, implying that larger models do not automatically become more exploratory.

inferencehighp.8

Reasoning mode helps exploratory creativity more than combinatorial creativity.

caveathighp.5p.5

Human checks support the benchmark but do not make it definitive; validity was 89.1% and ranking agreement with experts was ρ = 0.78.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.AI

Resource-constrained Amazons chess decision framework integrating large language models and graph attention

Tianhao Qian et al.

cs.SE

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

Ben Rank et al.

cs.RO

Latent World Models for Automated Driving: A Unified Taxonomy, Evaluation Framework, and Open Challenges

Rongxiang Zeng, Yongqi Dong

cs.AI

HLER: Human-in-the-Loop Economic Research via Multi-Agent Pipelines for Empirical Discovery

Chen Zhu, Xiaolu Wang

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark