Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification
CoRR(2024)
摘要
Identifying labels that did not appear during training, known as multi-label
zero-shot learning, is a non-trivial task in computer vision. To this end,
recent studies have attempted to explore the multi-modal knowledge of
vision-language pre-training (VLP) models by knowledge distillation, allowing
to recognize unseen labels in an open-vocabulary manner. However, experimental
evidence shows that knowledge distillation is suboptimal and provides limited
performance gain in unseen label prediction. In this paper, a novel query-based
knowledge sharing paradigm is proposed to explore the multi-modal knowledge
from the pretrained VLP model for open-vocabulary multi-label classification.
Specifically, a set of learnable label-agnostic query tokens is trained to
extract critical vision knowledge from the input image, and further shared
across all labels, allowing them to select tokens of interest as visual clues
for recognition. Besides, we propose an effective prompt pool for robust label
embedding, and reformulate the standard ranking learning into a form of
classification to allow the magnitude of feature vectors for matching, which
both significantly benefit label recognition. Experimental results show that
our framework significantly outperforms state-of-the-art methods on zero-shot
task by 5.9
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要