This paper proposes CAST-MSP, a novel architecture that extends the CAST framework to video generation, introducing a dual-stream spatial-temporal generator, a multi-step state prediction module, and a progressive training curriculum. CAST-MSP achieves superior temporal consistency in compositional video generation compared to autoregressive baselines.
Key findings
CAST-MSP extends the CAST framework to video generation with a dual-stream spatial-temporal generator.
A multi-step state prediction module forecasts multiple future states to reduce error accumulation.
A cycle-consistency training objective enforces temporal consistency without teacher models.
CAST-MSP demonstrates state-of-the-art performance in long-horizon video composition tasks.
Limitations & open questions
Further research is needed to improve generalization to even longer video sequences.