This paper addresses the challenge of managing Key-Value caches in Large Language Model serving systems across diverse workloads. AdaptCache, an adaptive cache policy framework, dynamically adjusts strategies based on real-time workload characteristics, improving cache hit rates, reducing latency, and increasing throughput.
Key findings
AdaptCache improves cache hit rates by up to 2.3x, reduces p99 latency by 47%, and increases throughput by 1.8x compared to state-of-the-art policies.
The framework introduces a workload-aware policy engine that predicts cache utility and selects optimal strategies for different workload segments.
A lightweight reinforcement learning-based policy selector achieves near-optimal performance with minimal overhead.
Limitations & open questions
The study focuses on production-like traces, further real-world validation may be required.
The framework's scalability and performance in other types of heterogeneous environments need to be explored.