This paper introduces a self-supervised video foundation model for multi-class endoscopic lesion detection and temporal tracking. The model leverages Video Masked Autoencoders with anatomical guidance to learn from unlabeled colonoscopy videos, achieving superior performance over CNN baselines and vision-language models.
Key findings
Proposes a self-supervised video foundation model pre-trained on unlabeled colonoscopy videos.
Employs VideoMAE with anatomical guidance to capture spatio-temporal features.
Achieves 76.54% mAP@0.5, outperforming CNN and zero-shot detection baselines.
Demonstrates the importance of temporal consistency and anatomical prior integration.
Limitations & open questions
Limited evaluation on diverse lesion types and rare categories.
Performance on real-world deployment and clinical impact remains to be validated.