LLMServingSim 3.0: Extending Profile-Based Modeling to Mu...

ABSTRACT

LLMServingSim 3.0 extends profile-based performance modeling to capture multi-tenant workload interference in GPU clusters used for deploying Large Language Model inference services. It introduces interference-aware operator profiles, a contention model for shared resources, a tenant-aware simulation loop, and capabilities for evaluating interference mitigation policies.

PAPER · PDF

LLMServingSim3_0_Proposal.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

LLMServingSim 3.0 captures multi-tenant workload interference in LLM serving systems.

Introduces interference-aware operator profiles and multi-level contention modeling.

Includes a tenant-aware simulation loop and mitigation policy evaluation framework.

Limitations & open questions

Validation plan targets <5% error for throughput and latency predictions under interference.

LLMServingSim 3.0: Extending Profile-Based Modeling to Multi-Tenant Workload Interference Scenarios

Key findings

Limitations & open questions

Related Papers