KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Apr 6, 2026

Published

Apr 9, 2026, 4:50 PM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

Open the original arXiv page

Score 79Full-paper briefagentsinferenceinfradata

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

Most mobile-agent demos still test whether a model can tap the right buttons; this benchmark tests the harder commercial question: can it figure out what a specific user wants, decide whether to step in, and stop when told no. The paper’s main result is sobering but useful: today’s strongest models are decent at explicit app navigation, yet performance drops sharply once work depends on preference inference or calibrated proactivity, with even the best overall model reaching 60.4% success and frontier systems falling below 50% on vague instructions. If that holds up, the near-term bottleneck for consumer assistants, enterprise copilot workflows, and device makers is not better GUI control alone but better memory, consent, and intervention policy.

If you are evaluating mobile or desktop agents, assume that reliable clicking is no longer the main differentiator. This benchmark suggests the bigger commercial gap is whether the system can infer preferences from messy history, ask clarifying questions efficiently, and handle consent without becoming either intrusive or passive.
A vendor claiming "agentic" readiness should be able to explain when the agent asks first, when it acts autonomously, how it handles rejection, and how it avoids unwanted interventions. In this paper, 80% of proactive failures came from bad timing or bad initiative calibration rather than downstream GUI execution.
The practical race may shift toward systems that can learn from behavioral logs without exposing a static user profile, because that is closer to how a real assistant would have to operate in production. Teams building assistants, commerce automation, or device experiences should look for products that can recover preferences from noisy histories and improve through clarification, not just from hard-coded settings.
The headline model result is respectable but not deployment-grade for sensitive autonomy: the best overall success rate is 60.4%, and the paper says frontier models can fall below 50% on vague instructions. That is strong evidence of a real bottleneck, but only medium evidence for exact production failure rates because the study uses a reproducible emulator and a GPT-4o-based user simulator rather than live users.

Affiliations

Institution names extracted from the brief's PDF summary call.

Zhejiang University

Author marker 1

From PDF summary

Apple

Author marker 2

From PDF summary

Tencent

Author marker 3

From PDF summary

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

stackhighp.3p.3

KnowU-Bench evaluates personalized and proactive mobile-agent behavior in a reproducible Android emulator rather than offline logs alone.

capabilityhighp.1p.6

The benchmark hides user profiles and forces models to infer preferences from behavior and dialogue, which is closer to real assistant deployment conditions.

capabilityhighp.2p.1

Current frontier models still show a large performance drop when tasks require personalization or calibrated proactivity.

strategichighp.10p.3

The main operational bottleneck appears to be intervention calibration and preference elicitation, not basic GUI navigation.

caveatmediump.6p.3

External validity remains uncertain because the benchmark relies on simulation and emulator instrumentation.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.LG

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Fei Tang et al.

Read brief arXiv

cs.LG

Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

Zijian Zhao, Jing Gao, Sen Li

Read brief arXiv

cs.AI

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma et al.

Read brief arXiv

cs.AI

Don't Overthink It: Inter-Rollout Action Agreement as a Free Adaptive-Compute Signal for LLM Agents

Khushal Sethi

Read brief arXiv