GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
CoRR(2024)
摘要
Given unlabelled datasets containing both old and new categories, generalized
category discovery (GCD) aims to accurately discover new classes while
correctly classifying old classes, leveraging the class concepts learned from
labeled samples. Current GCD methods only use a single visual modality of
information, resulting in poor classification of visually similar classes.
Though certain classes are visually confused, their text information might be
distinct, motivating us to introduce text information into the GCD task.
However, the lack of class names for unlabelled data makes it impractical to
utilize text information. To tackle this challenging problem, in this paper, we
propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings
for unlabelled samples. Specifically, our TES leverages the property that CLIP
can generate aligned vision-language features, converting visual embeddings
into tokens of the CLIP's text encoder to generate pseudo text embeddings.
Besides, we employ a dual-branch framework, through the joint learning and
instance consistency of different modality branches, visual and semantic
information mutually enhance each other, promoting the interaction and fusion
of visual and text embedding space. Our method unlocks the multi-modal
potentials of CLIP and outperforms the baseline methods by a large margin on
all GCD benchmarks, achieving new state-of-the-art. The code will be released
at .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要