This research proposes Agent-Amodal, a framework that enables unsupervised amodal completion without paired annotations by framing it as active perception. An embodied agent learns to explore visual scenes and accumulate multi-view evidence through curiosity-driven viewpoint selection. The approach utilizes a temporal-consistent completion network with self-supervised geometric consistency losses and occlusion-aware contrastive learning. This method addresses limitations of supervised and synthetic data approaches while laying groundwork for autonomous amodal perception systems.
Key findings
Curiosity-driven exploration module strategically selects viewpoints to disambiguate occlusions using reinforcement learning with information gain rewards
Temporal-aggregation completion network maintains and updates shape hypotheses while enforcing geometric consistency across multiple observations
Self-supervised objectives including reconstruction losses and occlusion-aware contrastive learning enable training without ground-truth amodal masks
Framework designed to handle novel categories and embodied AI scenarios without expensive human annotations or synthetic data domain gaps
Comprehensive validation planned on KINS, COCOA, and D2SA benchmarks with ablation studies to isolate component contributions
Limitations & open questions
Method is currently a research proposal without empirical validation or implemented results
Reliance on curiosity-driven exploration may face challenges in highly cluttered scenes with severe occlusions
Self-supervised learning objectives may produce ambiguous completions for objects with high shape variability