Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation
CoRR(2024)
摘要
Despite the significant progress in deep learning for dense visual
recognition problems, such as semantic segmentation, traditional methods are
constrained by fixed class sets. Meanwhile, vision-language foundation models,
such as CLIP, have showcased remarkable effectiveness in numerous zero-shot
image-level tasks, owing to their robust generalizability. Recently, a body of
work has investigated utilizing these models in open-vocabulary semantic
segmentation (OVSS). However, existing approaches often rely on impractical
supervised pre-training or access to additional pre-trained networks. In this
work, we propose a strong baseline for training-free OVSS, termed
Neighbour-Aware CLIP (NACLIP), representing a straightforward adaptation of
CLIP tailored for this scenario. Our method enforces localization of patches in
the self-attention of CLIP's vision transformer which, despite being crucial
for dense prediction tasks, has been overlooked in the OVSS literature. By
incorporating design choices favouring segmentation, our approach significantly
improves performance without requiring additional data, auxiliary pre-trained
networks, or extensive hyperparameter tuning, making it highly practical for
real-world applications. Experiments are performed on 8 popular semantic
segmentation benchmarks, yielding state-of-the-art performance on most
scenarios. Our code is publicly available at https://github.com/sinahmr/NACLIP .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要