ABSTRACT
LLMServingSim 3.0 extends profile-based performance modeling to capture multi-tenant workload interference in GPU clusters used for deploying Large Language Model inference services. It introduces interference-aware operator profiles, a contention model for shared resources, a tenant-aware simulation loop, and capabilities for evaluating interference mitigation policies.
PAPER · PDF
Loading PDF...
Key findings
LLMServingSim 3.0 captures multi-tenant workload interference in LLM serving systems.
Introduces interference-aware operator profiles and multi-level contention modeling.
Includes a tenant-aware simulation loop and mitigation policy evaluation framework.
Limitations & open questions
Validation plan targets <5% error for throughput and latency predictions under interference.