CAST with Multi-Step State Prediction for Long-Horizon Vi...

ABSTRACT

This paper proposes CAST-MSP, a novel architecture that extends the CAST framework to video generation, introducing a dual-stream spatial-temporal generator, a multi-step state prediction module, and a progressive training curriculum. CAST-MSP achieves superior temporal consistency in compositional video generation compared to autoregressive baselines.

PAPER · PDF

cast_msp_paper.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

CAST-MSP extends the CAST framework to video generation with a dual-stream spatial-temporal generator.

A multi-step state prediction module forecasts multiple future states to reduce error accumulation.

A cycle-consistency training objective enforces temporal consistency without teacher models.

CAST-MSP demonstrates state-of-the-art performance in long-horizon video composition tasks.

Limitations & open questions

Further research is needed to improve generalization to even longer video sequences.

CAST with Multi-Step State Prediction for Long-Horizon Video Composition

Key findings

Limitations & open questions

Related Papers