Adaptive Resource Scheduling for Heterogeneous Multimodal...

ABSTRACT

Training large multimodal language models (MLLMs) presents unique challenges due to their heterogeneous nature. This paper proposes an adaptive resource scheduling framework that dynamically maps operators to pipeline stages based on real-time profiling of compute intensity, memory pressure, and communication patterns. The approach introduces a workload balancing algorithm that continuously monitors stage execution times and redistributes operators to minimize pipeline bubbles, leading to significant throughput improvements.

PAPER · PDF

main.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

Adaptive resource scheduling framework proposed for efficient MLLM training in heterogeneous data centers.

Dynamic Operator Mapping algorithm assigns operators to pipeline stages based on real-time profiling.

Workload Balancing mechanism detects load imbalance and migrates operators to minimize pipeline bubbles.

Heterogeneity Awareness integrates GPU capability profiles into scheduling decisions.

Extensive experiments show up to 149.6% throughput improvement over Megatron-LM baselines.

Limitations & open questions

Further research needed on scalability and optimization for larger clusters.

Adaptive Resource Scheduling for Heterogeneous Multimodal LLM Training

Key findings

Limitations & open questions

Related Papers