PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Apr 27, 2026

Published

Apr 27, 2026, 8:10 PM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB -- a 97.7% reduction -- while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to -0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy-compressed KV pool with multi-reader concurrent agent access.

Open the original arXiv page

Score 82Full-paper briefinferenceinfraagentsmodels

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

If correct, PolyKV attacks a practical bottleneck in agentic AI: every agent rereading the same long context currently tends to carry its own expensive KV cache. The paper’s core move is to turn that duplicated GPU memory into a single compressed shared resource, with a reported Llama-3-8B case cutting 15-agent KV cache memory from 19.8 GB to 0.45 GB with small proxy-quality loss. This is an inference-serving idea, not a new model capability, and it looks promising but not production-proven because latency, throughput, and task-level outcomes are still missing.

If several agents need to work from the same long document, PolyKV suggests you should not pay for the same cache N times. That matters for research, diligence, legal review, support, and analytics workflows where many specialized agents fan out over one source packet.
The useful shift is not just the 2.91× compression; it is replacing per-agent cache duplication with a shared pool. In the paper’s largest Llama-3-8B example, that combination cuts KV cache memory from 19.798 GB to 0.454 GB for 15 agents, which changes how many concurrent agents can fit on the same GPU budget.
A buyer should ask whether an inference or agent platform can share and compress prefilled context across agents, then demand time-to-first-token, throughput, and end-to-end latency numbers. This paper shows a strong memory story, but explicitly does not report the runtime metrics that decide production economics.
The adoption signal to watch is support in production inference stacks, tested on larger models and messier enterprise workloads, not another isolated compression table. The reported stability across two small-to-mid model sizes is encouraging, but it is not yet evidence for 70B-class deployments or high-throughput agent systems.
Perplexity and BERTScore say the compressed cache did not obviously distort model behavior in these tests, but they are still proxy measures. The claim that compression noise may even help on coherent documents is explicitly a hypothesis, not a result to build product promises around.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

stackhighp.4

PolyKV uses a write-once, read-many shared KV cache pool rather than allocating a separate full cache per agent.

inferencehighp.4

The paper reports a stable 2.91× compression ratio from int8 keys and 3-bit value compression.

inferencehighp.6

The largest reported example reduces KV cache memory by 97.7% for 15 Llama-3-8B agents sharing a long context.

caveathighp.9

The paper does not establish production serving gains because TTFT, throughput, and end-to-end latency are deferred.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.MA

Dynamic Attentional Context Scoping: Agent-Triggered Focus Sessions for Isolated Per-Agent Steering in Multi-Agent LLM Orchestration

Nickson Patel

Read brief arXiv

cs.LG

Gym-Anything: Turn any Software into an Agent Environment

Pranjal Aggarwal, Graham Neubig, Sean Welleck

Read brief arXiv

cs.LG

Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

Zijian Zhao, Jing Gao, Sen Li

Read brief arXiv

cs.LG

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Fei Tang et al.

Read brief arXiv