This paper proposes HiMemVLN, a novel architecture for Vision-Language Navigation (VLN) that introduces Streaming Visual Memory with hierarchical organization for continuous environment adaptation. The method includes a Multi-Resolution Memory Bank, Dynamic Attention Routing mechanism, and an Episodic Consolidation process, achieving state-of-the-art performance on VLN-CE, R2R, and REVERIE benchmarks.
Key findings
HiMemVLN introduces a hierarchical streaming visual memory architecture for continuous environment adaptation in VLN.
The architecture includes a Multi-Resolution Memory Bank, Dynamic Attention Routing, and Episodic Consolidation.
Achieved state-of-the-art performance with a 4.2% success rate improvement on unseen environments and 35% reduced memory footprint.
Limitations & open questions
The paper does not extensively discuss the scalability of HiMemVLN to other types of navigation tasks beyond VLN.