Token-Grained Pipelining for Sparse Attention Patterns in...

ABSTRACT

This paper addresses scalability challenges in long-context LLM inference due to quadratic computational complexity of attention mechanisms. It introduces Token-Grained Pipelining (TGP), which interleaves sparse attention computations at token granularity, enabling latency hiding through asynchronous memory access and compute overlap. TGP includes innovations like a sparsity-aware token scheduler, a streaming KV cache pipeline, and a hierarchical attention engine. Theoretical analysis shows TGP can achieve up to 3.2x reduction in time-to-first-token latency.

PAPER · PDF

manuscript.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

TGP interleaves sparse attention computations at token granularity for latency hiding.

Introduces a sparsity-aware token scheduler for dynamic workload partitioning.

Proposes a streaming KV cache pipeline that decouples memory fetch from attention computation.

Hierarchical attention engine exploits heterogeneous sparsity patterns across heads and layers.

Theoretical analysis shows up to 3.2x reduction in time-to-first-token latency.

Limitations & open questions

Evaluation plan is outlined but not yet implemented, requiring empirical validation.

Token-Grained Pipelining for Sparse Attention Patterns in Long-Context LLM Inference

Key findings

Limitations & open questions

Related Papers