This paper addresses scalability challenges in long-context LLM inference due to quadratic computational complexity of attention mechanisms. It introduces Token-Grained Pipelining (TGP), which interleaves sparse attention computations at token granularity, enabling latency hiding through asynchronous memory access and compute overlap. TGP includes innovations like a sparsity-aware token scheduler, a streaming KV cache pipeline, and a hierarchical attention engine. Theoretical analysis shows TGP can achieve up to 3.2x reduction in time-to-first-token latency.
Key findings
TGP interleaves sparse attention computations at token granularity for latency hiding.
Introduces a sparsity-aware token scheduler for dynamic workload partitioning.
Proposes a streaming KV cache pipeline that decouples memory fetch from attention computation.
Hierarchical attention engine exploits heterogeneous sparsity patterns across heads and layers.
Theoretical analysis shows up to 3.2x reduction in time-to-first-token latency.
Limitations & open questions
Evaluation plan is outlined but not yet implemented, requiring empirical validation.