This paper presents DexRep-Vis, a novel representation that extends geometric hand-object representations with visual texture features to enable robust manipulation of transparent and reflective objects. The method fuses geometric point cloud features from partial depth observations with RGB-based texture and material features extracted through a dedicated visual encoder. It is trained end-to-end using a transformer-based architecture that captures spatial relationships between hand and object representations. Experiments demonstrate DexRep-Vis achieves 87.3% grasp success rate on transparent objects, outperforming baselines.
Key findings
DexRep-Vis extends geometric hand-object representations with visual texture features.
Fuses geometric point cloud features with RGB-based texture and material features.
Adaptively weights geometric and visual features based on depth reliability estimates.
Trained end-to-end using a transformer-based architecture capturing spatial relationships.
Achieves 87.3% grasp success rate on transparent objects, outperforming baselines.
Limitations & open questions
The performance of DexRep-Vis may be limited in environments with extreme lighting conditions.
The method's reliance on RGB images may affect its applicability in low-light scenarios.