The paper proposes Hierarchical JEPA, a novel architecture that combines Joint-Embedding Predictive Architectures with hierarchical temporal abstraction to enable robots to learn action sequences from video demonstrations, addressing challenges like semantic gap, long-horizon temporal reasoning, and efficient knowledge transfer.
Key findings
Hierarchical JEPA introduces a multi-level predictive architecture for learning action sequences from video demonstrations.
The model captures compositional task structure and generalizes to novel task combinations.
It learns robust representations invariant to visual distractors while preserving task-relevant semantic information.
Limitations & open questions
The paper does not discuss the scalability of the proposed architecture for very long or complex tasks.