Tool Receipts, Not Zero-Knowledge Proofs: Practical Hallucination Detection for AI Agents explained

Brief context

Publication timing, weekly edition context, and source links for this brief.

Week

Mar 9, 2026

Published

Mar 9, 2026, 8:45 PM

Current score

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

AI agents that execute tasks via tool calls frequently hallucinate results - fabricating tool executions, misstating output counts, or presenting inferences as facts. Recent approaches to verifiable AI inference rely on zero-knowledge proofs, which provide cryptographic guarantees but impose minutes of proving time per query, making them impractical for interactive agents. We propose NabaOS, a lightweight verification framework inspired by Indian epistemology (Nyaya Shastra), which classifies every claim in an LLM response by its epistemic source (pramana): direct tool output (pratyaksha), inference (anumana), external testimony (shabda), absence (abhava), or ungrounded opinion. Our runtime generates HMAC-signed tool execution receipts that the LLM cannot forge, then cross-references claims against these receipts to detect hallucinations in real time. We evaluate on NyayaVerifyBench, a new benchmark of 1,800 agent response scenarios across four languages with injected hallucinations of six types. NabaOS detects 94.2% of fabricated tool references, 87.6% of count misstatements, and 91.3% of false absence claims, with <15ms verification overhead per response. For deep delegation (agents performing multi-step web tasks), our cross-checking protocol catches 78.4% of URL fabrications via independent re-fetching. We compare against five approaches: zkLLM (cryptographic proofs, 180s/query), TOPLOC (locality-sensitive hashing), SPEX (sampling-based proof of execution), tensor commitments, and self-consistency checking. NabaOS achieves the best cost-latency-coverage trade-off for interactive agents: 94.2% coverage at <15ms versus zkLLM's near-perfect coverage at 180,000ms. For interactive agents, practical receipt-based verification provides better cost-benefit than cryptographic proofs, and epistemic classification gives users actionable trust signals rather than binary judgments.

Open the original arXiv page

Score 76Full-paper briefagentsinferenceinfra

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

If this holds up, a meaningful chunk of agent reliability stops being a hard cryptography problem and becomes an engineering discipline: instrument every tool call, issue tamper-resistant receipts, and verify what the agent says before it reaches the user. That matters because it makes real-time hallucination checking practical for customer-facing and employee-facing agents, with the paper reporting 91% detection at about 12 ms overhead instead of minutes-long proof systems. The likely implication is pressure on agent platforms, workflow vendors, and internal AI teams to compete on auditability and grounded outputs, not just model quality—though this is benchmark evidence on a new dataset, not proof that every production agent stack will get the same protection.

If your team has treated agent hallucination control as a model-selection problem, this paper argues a lot of the fix sits in orchestration: signed tool receipts, deterministic checks, and trust labels at the runtime layer. That shifts attention toward platform, security, and workflow engineering rather than waiting for a better base model.
A useful buying question is whether the agent platform can show tool-by-tool receipts, output hashes, result counts, and which claims were directly grounded versus inferred. The paper's advantage comes from that instrumentation, so vendors that cannot surface it may be giving you confidence theater rather than verification.
The strongest story is for agents whose tool calls run inside a controlled runtime. Once agents roam the web or delegate across steps, the system falls back to cross-checking like URL re-fetching, which is slower and less complete—reported at 78.4% detection for URL fabrications with 200–500 ms added latency.
The practical win is not just catching bad outputs, but routing work differently based on confidence—auto-send 'Fully Verified' results downstream, escalate weaker ones to review, or block actions on ungrounded claims. The calibration result is promising, but it still comes from the authors' benchmark rather than live production traffic.
This is a practical guardrail, not a full proof system. It depends on the runtime keeping the HMAC key secret, on the model cooperating with self-tagging often enough to preserve granularity, and it does not tell you whether the underlying tool or source was itself wrong.

Evidence ledger

The strongest claims in the brief, along with the confidence and citation depth behind them.

inferencehighp.10p.15

Receipt-based verification can catch a large share of agent hallucinations with low latency overhead in the authors' benchmark.

capabilityhighp.1p.1p.1

The framework is especially strong at detecting fabricated tool references, misstated counts, and false absence claims.

strategichighp.2p.12

The reported business-relevant advantage over zero-knowledge approaches is speed and deployability, not stronger formal guarantees.

capabilitymediump.3p.15

Trust labels may be operationally useful because 'Fully Verified' outputs are highly calibrated in the benchmark.

caveathighp.14p.13

The main caveat is scope: the system verifies claims against recorded tool outputs, not the truth of those outputs or every external-source claim.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.AI

Policy-Invisible Violations in LLM-Based Agents

Jie Wu, Ming Gong

Read brief arXiv

cs.AI

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Saroj Mishra

Read brief arXiv

cs.CR

MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic

Sultan Zavrak

Read brief arXiv

cs.CR

SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration

Jianshu She

Read brief arXiv