Distilling Text-Image Foundation Models

ICLR 2023(2023)

引用 0|浏览40
暂无评分
摘要
Large pretrained foundation models (such as CLIP, DALL-E) are among the most recent significant advances in the AI community. Their implication is profound. This paper examines the value of these foundation models as a model knowledge base -- we aim to distill the knowledge in these foundation models for training lightweight models designed for specific tasks in practical application scenarios with improved performance. Despite abundant progress in knowledge distillation (KD) in traditional models trained under the supervision of class labels in datasets encoded as integers, distilling such text-image contrastive learning model has not been explored extensively. Meanwhile, KD is well-known for being bothered by the capacity gap problem (i.e., distilling knowledge from a teacher significantly larger than a student often degrades the performance of the student). The teacher-student capacity gap in distilling foundation models is even larger. Therefore, how to overcome this potential issue is also elusive now. This paper presents detailed analyses of these questions aiming to successfully tap into a pretrained foundation model (CLIP) to boost the student's performance. Besides the practical performance benefits, several interesting discoveries are unveiled: (1) CLIP is not bothered by the capacity gap, which may let us re-evaluate if the "capacity-gap" issue is really due to the capacity gap (2) We find the reason is largely due to that CLIP is not over-confident on the wrong labels when misclassifies input image samples.
更多
查看译文
关键词
foundation models,CLIP,knowledge distillation,capacity gap
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要