Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Apr 27, 2026

Published

Apr 29, 2026, 3:35 PM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. We present Bian Que, an agentic framework with three contributions: (i) a \emph{unified operational paradigm} abstracting day-to-day O&M into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) \emph{Flexible Skill Arrangement}, where each Skill specifies which data and knowledge to retrieve for a given business-module context and can be automatically generated and updated by LLMs or iteratively refined through natural-language instructions from on-call engineers; (iii) a \emph{unified self-evolving mechanism} in which one correction signal drives two parallel pathways, case-memory-to-knowledge distillation and targeted Skill refinement. Deployed on the e-commerce search engine of KuaiShou, the major short-video platform in China, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, and cuts mean time to resolution by over 50%. Our framework achieves 99.0% pass rate on offline evaluations. Our code is available at https://github.com/benchen4395/BianQue_Assistant.

Open the original arXiv page

Score 86Full-paper briefagentsinfradatainference

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

LLM operations agents usually fail less because they cannot reason and more because they are handed the wrong pile of metrics, logs, change events, and tribal knowledge. Bian Que is interesting because it turns that routing problem into an editable, self-updating operations layer, and the authors report production-scale results at Kuaishou: far fewer alerts, less pager noise, and faster diagnosis. If this generalizes, SRE, platform, and observability teams should treat agent orchestration and feedback loops as a real automation lever, not a demo feature; the caveat is that the evidence is still from one large search environment and does not prove autonomous remediation.

The paper’s strongest claim is that operations agents work when they know exactly which metrics, logs, change events, traces, and runbook knowledge to pull for each incident. That shifts the buying question from “which LLM?” to “how does the system map each operational event to the right evidence and keep that mapping current?”
The reported deployment is on Kuaishou’s e-commerce search engine at very large scale, with fired alerts cut to 25% of baseline and practitioner-facing non-actionable alerts cut to about 5% of baseline. If those numbers survive independent replication, the immediate business case is reduced pager load and faster incident triage, not full automation of ops.
A useful O&M agent needs a correction loop that updates both incident knowledge and retrieval rules without requiring engineers to rewrite code. The paper reports that without this feedback pathway, live-alert accuracy degraded quickly, which is exactly the failure mode buyers should test in pilots.
The implementation is not a fine-tuned giant model, but the authors still recommend roughly 35B-parameter-class backbones for robust online reasoning. For infrastructure teams, the practical question is whether a single-GPU, mid-sized model setup can meet latency, availability, and security requirements inside the existing observability stack.
An 80% root-cause accuracy rate is useful for triage but not enough to hand over remediation authority, and the reported setting is one large search platform rather than a broad cross-industry test. The paper’s own limitations point to unfinished pieces: closed-loop remediation, better routing than keyword matching, and multi-agent coordination for complex incidents.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.8p.9

Production deployment reported materially lower alert volume and faster resolution.

stackhighp.5

The core mechanism is explicit selection of operational data and knowledge per event, rather than dumping all context into the LLM.

strategichighp.11p.11

A feedback loop that updates knowledge and Skills appears necessary for production usefulness as systems change.

inferencemediump.11

The framework does not require model fine-tuning, but does appear to require a reasonably capable model class for online use.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.LG

AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow

Jiale Liu, Nanzhe Wang

Read brief arXiv

cs.LG

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA et al.

Read brief arXiv

cs.SE

AIPC: Agent-Based Automation for AI Model Deployment with Qualcomm AI Runtime

Jianhao Su et al.

Read brief arXiv

cs.IR

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

Read brief arXiv