arXiv 2603.11896v1Mar 12, 2026

Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Lu Wang et al.

Brief context

Publication timing, weekly edition context, and source links for this brief.

Published

Mar 12, 2026, 1:13 PM

Current score

75

Original paper

The executive brief below is grounded in the source paper and linked back to the arXiv abstract.

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we introduce an efficient pipeline that overlaps watching and thinking and adaptively selects the best attention backend. Under both single-round and multi-round streaming input protocols, our method achieves strong results. Built on Qwen3-VL, it improves single-round accuracy by 2.6% on StreamingBench and by 3.79% on OVO-Bench. In the multi-round setting, it maintains performance while reducing output tokens by 56%. Code is available at: https://github.com/wl666hhh/Think_While_Watching/

Score 75PDF-backedinferencemodelstraininginfra

Executive brief

A short business-reader brief that explains why the paper matters now and what to watch or do next.

Why this is worth your attention

This paper matters because it attacks a practical bottleneck in live video AI: most multimodal models still work best when they can see the whole video first, which is a bad fit for surveillance, operations monitoring, customer support, robotics, and any workflow that needs answers while footage is still arriving. The claimed shift is not a giant raw-accuracy jump, but a more deployable operating mode: keep watching while answering, preserve useful memory across turns, and cut multi-turn output tokens by 56% without losing performance. If that holds in production, streaming video copilots get cheaper and more responsive to run; what remains uncertain is how much of the latency story survives outside the authors’ Qwen3-VL setup and benchmark-heavy evaluation.

  • The paper shows naive online use of a strong video model can collapse badly, while streaming-aligned training recovers performance. For teams evaluating live video AI, the question is no longer just model quality; it is whether the model was actually trained and engineered for streaming interaction rather than repurposed from offline video QA.
  • The most interesting gain here is operational, not dramatic benchmark dominance: 56% fewer output tokens in multi-round use, plus a pipeline that overlaps watching and answering. If a vendor claims live-video cost or latency advantages, ask whether those come from memory compression and inference scheduling like this, or just from using a smaller model or shorter prompts.
  • The architecture writes one compact memory note per video segment and that memory appears to matter: removing it drops multi-turn accuracy from 57.40% to 52.35% on StreamingBench. A real adoption signal would be video AI products exposing persistent session memory, event summaries, and replayable evidence trails instead of forcing every follow-up question to reprocess raw footage.
  • The paper reports a 92.6% TTFT reduction versus batch processing, but that TTFT is measured in tokens, not seconds, and the authors also acknowledge residual backlog from scheduling and cache overheads. That makes this a credible systems direction, not yet hard proof of production responsiveness under real camera streams and enterprise SLAs.
  • This design introduces an explicit operating knob: longer video segments reduce decoding tokens but also reduce accuracy, while shorter segments preserve accuracy at higher token cost. For operators, that means streaming video systems may become tuneable like search and retrieval systems, with segment size and memory granularity becoming real levers for cost, speed, and answer quality.

Evidence ledger

capabilityhighp.1

TWW improves single-round streaming accuracy on Qwen3-VL benchmarks.

inferencehighp.1

TWW preserves multi-round performance while reducing output tokens by 56%.

stackhighp.7

The method decouples visual ingestion from text decoding with a dual KV cache pipeline.

capabilityhighp.12

Persistent memory notes materially contribute to multi-turn accuracy.

caveatmediump.12p.22

Real-world responsiveness remains uncertain because latency is not reported in wall-clock terms and practical overheads remain.

Related briefs

More plain-English summaries from the archive with nearby topics or operator relevance.

cs.CV

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han et al.

cs.CV

COMIC: Agentic Sketch Comedy Generation

Susung Hong et al.

Thank you to arXiv for use of its open access interoperability. This product was not reviewed or approved by, nor does it necessarily express or reflect the policies or opinions of, arXiv.
LightDark