COMIC: Agentic Sketch Comedy Generation explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Mar 9, 2026

Published

Mar 11, 2026, 5:59 PM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.

Open the original arXiv page

Score 84Full-paper briefagentsmodelsinferencedata

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

AI video is getting good enough to make a one-minute sketch, but making something people actually want to watch is a much harder coordination problem than a raw model problem; this paper offers a clever multi-agent production pipeline with surprisingly solid internal evidence, though the “near professional” claim still looks mixed rather than proven.

The paper’s claim is that generating a watchable comedy sketch is mainly a coordination problem: modern video models can make shots, but they still struggle to assemble a funny, coherent 1–2 minute sequence without a lot of structured oversight. The authors argue that you get better results by organizing AI into production-like roles and letting those roles compete and revise each other, because humor is too subjective for a single fixed reward score to guide reliably.
Their proposed system, COMIC, uses a population of writer, critic, editor, director, and rendering-critic agents arranged in isolated “islands,” so multiple comedic styles can evolve in parallel instead of converging on one bland average. That mechanism matters because each island applies different tastes and feedback, creating a broader portfolio of scripts and giving the system a better chance of finding something funny and distinctive rather than just safe.
The implementation is more pragmatic than it first sounds: instead of training a bespoke humor model, the system generates a pool of critics by prompting an LLM, then keeps the ones that best match real audience engagement patterns from 4,940 YouTube examples across five comedy channels. It also uses storyboards and a memory bank before expensive rendering, which is operationally important because video generation is the costly step; the reported base setup runs in about one day on one H200 GPU with only about $5 in API spend, although compute rises quickly as more scripts and rendering critics are added.
The strongest evidence is that the evaluation machinery seems to work better than simpler alternatives: on held-out tests, task-wise critic selection reached 0.716 accuracy on top-versus-bottom script discrimination, versus 0.670 for a single best critic and 0.654 for averaging critics. The broader quality claims are encouraging but less definitive: human ratings show COMIC beating several agentic baselines on funniness and engagement measures, and automated rankings line up with human rankings, yet the human study is still modest at 110 responses total and the paper’s “approaching professional quality” claim should be read as internal benchmarking rather than settled proof.
This is one of the more credible arguments that agent structure can improve creative generation without retraining the base models, and it is especially relevant for teams trying to trade more test-time orchestration for better output quality. Still, the gains come with complexity, dependence on proprietary models, and an evaluation proxy based partly on YouTube views, which can be noisy and audience-specific, so the practical read is promising for controlled media workflows but not evidence that AI has solved comedy or robust long-form entertainment generation.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.1p.5

COMIC can generate 1–2 minute sketch-comedy videos from character and background references.

capabilityhighp.7

Task-wise critic selection outperforms mean and single-best critics on held-out script discrimination.

inferencehighp.21

The base configuration runs in about one day on a single H200 GPU with around $5 in API cost.

caveathighp.22

Human evaluation is modest in size, so broad quality claims should be treated cautiously.

stackmediump.14p.21

The system depends on multi-agent search and expensive rendering loops rather than model retraining, so quality improvements trade off against test-time complexity.