This paper presents OV-ContextNav, an extension of Context-Nav for open-vocabulary object categories using vision-language foundation models. It integrates an open-vocabulary detection module, hierarchical semantic mapping, and an adaptive verification framework, achieving zero-shot performance on HM3D-OVON benchmark.
Key findings
OV-ContextNav extends Context-Nav to handle open-vocabulary object categories.
Introduces an open-vocabulary detection module using CLIP-based grounding.
Employs a hierarchical semantic mapping system for diverse object types.
Adaptive verification framework manages uncertainty in open-vocabulary recognition.
Demonstrates competitive zero-shot performance on HM3D-OVON benchmark.
Limitations & open questions
The system's performance in highly dynamic environments is yet to be evaluated.
The generalization capability to completely novel object categories requires further testing.