Self-Supervised Video Foundation Models for Multi-Class E...

ABSTRACT

This paper introduces a self-supervised video foundation model for multi-class endoscopic lesion detection and temporal tracking. The model leverages Video Masked Autoencoders with anatomical guidance to learn from unlabeled colonoscopy videos, achieving superior performance over CNN baselines and vision-language models.

PAPER · PDF

main.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

Proposes a self-supervised video foundation model pre-trained on unlabeled colonoscopy videos.

Employs VideoMAE with anatomical guidance to capture spatio-temporal features.

Achieves 76.54% mAP@0.5, outperforming CNN and zero-shot detection baselines.

Demonstrates the importance of temporal consistency and anatomical prior integration.

Limitations & open questions

Limited evaluation on diverse lesion types and rare categories.

Performance on real-world deployment and clinical impact remains to be validated.

Self-Supervised Video Foundation Models for Multi-Class Endoscopic Lesion Detection and Temporal Tracking

Key findings

Limitations & open questions

Related Papers