This paper introduces TempoAct, a learning-free framework for active perception in multi-turn navigation tasks, addressing the challenge of temporal context accumulation without task-specific training. TempoAct combines information-theoretic exploration principles with foundation model-based scene understanding to enable embodied agents to actively gather, filter, and accumulate perceptual information across multiple decision turns. The framework operates in a zero-shot manner, making it immediately deployable in novel environments.
Key findings
TempoAct leverages information-theoretic exploration principles and foundation model-based scene understanding.
The framework includes an entropy-aware view selection module, a temporal context buffer with adaptive forgetting, and a multi-turn decision fusion mechanism.
TempoAct is expected to achieve competitive success rates without the data requirements and generalization limitations of learned policies.
Limitations & open questions
The paper is a research proposal and does not yet include experimental results.
The effectiveness of TempoAct in real-world scenarios is yet to be validated.