This research addresses the challenge of human pose estimation under occlusion and varying lighting by introducing MM-MoE-Pose, a multi-modal fusion framework that dynamically routes visual and inertial features through expert networks based on input reliability, enabling robust pose estimation.
Key findings
MM-MoE-Pose dynamically selects expert networks based on sensor reliability for robust pose estimation.
The framework includes modality-specific encoders, a sparse MoE fusion layer, a cross-modal calibration module, and a kinematic decoder.
Achieves state-of-the-art results in challenging scenarios while maintaining real-time inference capabilities.
Limitations & open questions
The paper does not discuss the computational overhead of the proposed framework.
Further research is needed to scale the framework for broader applications.