ABSTRACT
This research proposes Adaptive Prefill Granularity (APG), a method to dynamically adjust decomposition boundaries in LLM inference systems to optimize GPU resource sharing for heterogeneous workloads.
PAPER · PDF
Loading PDF...
Key findings
APG dynamically adjusts decomposition boundaries based on real-time workload analysis.
Introduces workload-aware granularity selector, boundary elasticity mechanism, and heterogeneous-SLO scheduler.
Reduces tail latency by up to 45% and improves throughput by 28% compared to static chunked-prefill baselines.
Limitations & open questions
The approach may require further optimization for rapidly shifting workloads.
The effectiveness of APG in different production environments needs further validation.