Monocular 3D keypoint lifting from 2D observations is a fundamental challenge in computer vision with applications in human pose estimation, robotics, and augmented reality. Current approaches either entangle depth and 2D pose features or rely on domain-specific architectures. We propose DisenMoE, a novel architecture that combines disentangled representation learning with a Mixture-of-Experts routing mechanism to achieve general-purpose 3D keypoint lifting.
Key findings
DisenMoE separates 2D pose features from depth estimation through specialized expert modules.
A learnable router dynamically assigns input keypoints to the most suitable experts based on skeletal topology and joint characteristics.
The design enables cross-domain generalization, efficient computation, and modular scalability.
Limitations & open questions
Risks include expert collapse, routing instability, and domain gap issues.