ABSTRACT
This paper presents StreamingLookahead, a novel framework that extends LookaheadKV to handle streaming long-context scenarios with online cache management. It introduces a Streaming Importance Predictor, an Online Eviction Scheduler, and a Future-Aware Importance Propagation mechanism, enabling efficient processing of unbounded context streams with a fixed-size KV cache.
PAPER · PDF
Loading PDF...
Key findings
StreamingLookahead achieves 94.5% of full-cache performance with only 5% of the cache size.
Outperforms StreamingLLM by 18.3% and H2O by 12.7% on streaming tasks.
Maintains constant per-token latency, achieving up to 47× speedup over full-cache baselines at 1M token contexts.
Limitations & open questions
The framework's performance in scenarios with extremely high token arrival rates is not evaluated.