This paper proposes a theoretical framework to analyze intermediate layer selection in CLIP models, introducing metrics to quantify embedding quality and deriving principles for optimal layer selection based on task-specific requirements.
Key findings
Contrastive Language-Image Pre-training (CLIP) has variable embedding quality across layers.
Embedding quality follows a non-monotonic progression through the network.
Optimal layers depend on task-specific requirements for specificity versus generality.
Theoretical analysis provides a foundation for principled layer selection to improve downstream task performance.
Limitations & open questions
The study focuses on CLIP models and may not generalize to all vision-language architectures.
The proposed layer selection strategies require further experimental validation across various tasks.