arXiv 2605.09863v1May 11, 2026

Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents

Chunxiao Wang

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

May 11, 2026, 1:49 AM

Current score

85

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Production LLM coding agents drift over long sessions: they forget user-specified constraints, slip into mistakes the user already flagged, and confabulate prior agreements. White-box approaches such as persona vectors require model weights and so cannot be applied to closed APIs (Claude, GPT-4) that most users actually interact with. We present Nautilus Compass, a black-box persona drift detector and agent memory layer for production coding agents. The method operates entirely at the prompt-text layer: cosine similarity between user prompts and behavioral anchor texts, aggregated by a weighted top-k mean using BGE-m3 embeddings. Compass is, to our knowledge, the only public agent memory layer (among Mem0, Letta, Cognee, Zep, MemOS, smrti verified May 2026) that does not call an LLM at index time to extract facts or build a graph; raw conversation text is embedded directly. The system ships as a Claude Code plugin, an MCP 2024-11-05 A2A server (Cursor, Cline, Hermes), a CLI, and a REST API on one daemon, with a Merkle-chained audit log for tamper-evident anchor updates. On a held-out test set built from real Claude Code session traces and labeled by an independent LLM judge, Compass reaches ROC AUC 0.83 for drift detection. The embedded retrieval pipeline scores 56.6% on LongMemEval-S v0.8 and 44.4% on EverMemBench-Dynamic (n=500), topping the four published EverMemBench Table 4 baselines. LongMemEval-S 56.6% is ~30 points below recent white-box leaders (90+%); we treat that as the architectural ceiling of the no-extraction design. End-to-end reproduction cost is $3.50 (~14x cheaper than GPT-4o-judged stacks). A paired cross-vendor behavior A/B accompanies these numbers as preliminary system-level evidence. Code, anchors, frozen test data, and audit-log tooling are MIT-licensed at github.com/chunxiaoxx/nautilus-compass.

Score 85Full-paper briefagentsdatainferenceinfra

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

Long-running LLM agents fail in a very operational way: they forget constraints, repeat corrected mistakes, and invent agreements from earlier context. This paper’s bet is that enterprises do not need model weights or expensive LLM-based memory extraction to catch some of that drift; a cheap embedding-and-anchor layer around closed coding agents may be enough to create alerts, recall prior instructions, and leave an audit trail. The evidence is encouraging for coding-agent workflows, but it is not yet proof that alerts reliably improve behavior across domains or vendors.

  • The useful shift is not a better coding model; it is a lightweight control layer that can sit around closed models such as Claude or GPT-4 without access to weights. If this works, enterprises can add drift detection, memory recall, and auditability to agent workflows without waiting for the model vendor to expose internals.
  • Do not treat this as a plug-and-play safety module. The paper’s own ablations suggest the biggest performance gain came from rewriting behavioral anchors into task-shaped language, so deployment in legal, finance, medical, or engineering workflows would need domain owners to define what “drift” actually means.
  • The strongest clean headline is 0.83 ROC AUC on held-out Claude Code traces labeled by an LLM judge; the higher 0.9232 AUC is explicitly not a generalization number because hard false positives were folded back into the anchors. Buyers should ask for evaluations on their own session logs before trusting alert thresholds.
  • The A/B evidence says the alert improved fabrication resistance, but did not materially move verification, destructive-action refusal, or secret-handling on aggregate. The next meaningful signal is whether drift alerts reduce real rework, repeated mistakes, or unsafe tool use over long production sessions, not just whether they classify prompts well.
  • For agent-platform vendors, the practical question is whether memory and drift checks run locally, cheaply, and fast enough for interactive coding. Compass reports a warm path around 1.8 seconds and a very low reproduction cost, but also shows that heavier reranking can add large latency for negligible drift-detection gain.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

capabilityhighp.1

Compass reports 0.83 ROC AUC for persona-drift detection on held-out Claude Code session traces labeled by an independent LLM judge.

stackhighp.1p.1

Compass is a black-box, prompt-layer system that embeds raw conversation text directly and does not require model weights or an LLM call at index time.

capabilitymediump.1p.1

The memory-retrieval pipeline is competitive among black-box/no-extraction approaches but remains well below recent white-box leaders on LongMemEval-S.

traininghighp.13

Anchor design appears to be the dominant determinant of drift-detection performance, outweighing embedder changes in the authors’ ablation.

caveatmediump.16p.16

Cross-vendor behavior steering showed a statistically significant improvement only on fabrication resistance, with no meaningful aggregate improvement on the other tested axes.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.CR

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang et al.

cs.CR

MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic

Sultan Zavrak

cs.AI

Policy-Invisible Violations in LLM-Based Agents

Jie Wu, Ming Gong

cs.CR

Tool Receipts, Not Zero-Knowledge Proofs: Practical Hallucination Detection for AI Agents

Abhinaba Basu

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark