no code implementations • 17 Apr 2024 • Wenbo Zhang, Yifan Zhang, Jianfeng Lin, Binqiang Huang, Jinlu Zhang, Wenhao Yu
Pre-trained vision-language (V-L) models such as CLIP have shown excellent performance in many downstream cross-modal tasks.
Image Classification Knowledge Distillation +2