Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval

Haoxiang Zhang,Deqiang Cheng,He Jiang,Jingjing Liu,Qiqi Kou

Multimedia Tools and Applications（2023）

引用 0|浏览0

暂无评分

摘要

The Contrastive Language-Image Pre-training model (CLIP) has recently gained attention in the zero-shot domain. However it still falls short in addressing cross-modal perception, and the semantic gap between seen and unseen classes in Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR). To overcome these obstacles, we propose a Task-Like Training paradigm (TLT). In this work, we view the cross-modal perception and the semantic gap as a multi-task learning process. Before tackling the challenges, we fully utilize CLIP’s text encoder and propose text-based identification learning mechanism to assist the model to learn discriminative features quickly. Next, we propose text prompt tutoring and the cross-modal consistency learning to solve cross-modal perception and the semantic gap, respectively. Meanwhile, we present a collaborative architecture to explore the potential shared information between tasks. Extensive results show that our approach significantly outperforms the state-of-the-art methods on Sketchy, Sketchy-No, Tuberlin, and QuickDraw datasets.

查看译文

关键词

Zero-shot sketch-based image retrieval,Text prompt tutoring,Text-based identification learning,Cross-modal consistency learning,Collaborative architecture

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要