Hierarchical JEPA for Robotic Action Sequencing from Vide...

ABSTRACT

The paper proposes Hierarchical JEPA, a novel architecture that combines Joint-Embedding Predictive Architectures with hierarchical temporal abstraction to enable robots to learn action sequences from video demonstrations, addressing challenges like semantic gap, long-horizon temporal reasoning, and efficient knowledge transfer.

PAPER · PDF

Hierarchical_JEPA_Robot_Action_Sequencing.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

Hierarchical JEPA introduces a multi-level predictive architecture for learning action sequences from video demonstrations.

The model captures compositional task structure and generalizes to novel task combinations.

It learns robust representations invariant to visual distractors while preserving task-relevant semantic information.

Limitations & open questions

The paper does not discuss the scalability of the proposed architecture for very long or complex tasks.

Hierarchical JEPA for Robotic Action Sequencing from Video Instructions

Key findings

Limitations & open questions

Related Papers