This paper presents TemporalViper, a novel framework that extends code-as-perception to video-based STEM reasoning through temporal code generation. It introduces a temporal program representation, a modular video understanding architecture, and an adaptive temporal memory mechanism. The framework achieves state-of-the-art performance on compositional spatio-temporal reasoning tasks while maintaining interpretability and compositional generalization benefits.
Key findings
TemporalViper extends code-as-perception to video-based STEM reasoning.
The framework introduces explicit temporal operators and stateful variable tracking.
It integrates spatio-temporal perception modules with domain-specific scientific reasoning operators.
TemporalViper achieves state-of-the-art performance on compositional spatio-temporal reasoning tasks.
Limitations & open questions
The paper does not discuss the computational complexity of TemporalViper.
The scalability of the framework to longer video sequences is not addressed.