This research proposes MAVL-JEPA, a novel self-supervised learning framework extending JEPA to tri-modal audio, visual, and language settings. MAVL-JEPA learns joint representations through cross-modal prediction in a latent space, employing modality-specific encoders and a language grounding module for sensory and symbolic integration.
Key findings
MAVL-JEPA extends JEPA paradigm to tri-modal settings, learning joint representations across audio, visual, and language.
The framework includes cross-modal predictive learning objectives for deep multimodal grounding.
A language grounding module connects linguistic concepts to audio-visual experiences for zero-shot reasoning.
Comprehensive evaluation protocols are presented for audio-visual event localization, cross-modal retrieval, and multimodal reasoning tasks.
Limitations & open questions
The framework's effectiveness in real-world applications with limited supervision is yet to be determined.
The integration of sensory perception with symbolic understanding presents ongoing challenges.