Multimodal JEPA for Audio-Visual-Language Grounding and R...

ABSTRACT

This research proposes MAVL-JEPA, a novel self-supervised learning framework extending JEPA to tri-modal audio, visual, and language settings. MAVL-JEPA learns joint representations through cross-modal prediction in a latent space, employing modality-specific encoders and a language grounding module for sensory and symbolic integration.

PAPER · PDF

manuscript.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

MAVL-JEPA extends JEPA paradigm to tri-modal settings, learning joint representations across audio, visual, and language.

The framework includes cross-modal predictive learning objectives for deep multimodal grounding.

A language grounding module connects linguistic concepts to audio-visual experiences for zero-shot reasoning.

Comprehensive evaluation protocols are presented for audio-visual event localization, cross-modal retrieval, and multimodal reasoning tasks.

Limitations & open questions

The framework's effectiveness in real-world applications with limited supervision is yet to be determined.

The integration of sensory perception with symbolic understanding presents ongoing challenges.

Multimodal JEPA for Audio-Visual-Language Grounding and Reasoning

Key findings

Limitations & open questions

Related Papers