OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation
CoRR(2024)
摘要
The recent success of CLIP has demonstrated promising results in zero-shot
semantic segmentation by transferring muiltimodal knowledge to pixel-level
classification. However, leveraging pre-trained CLIP knowledge to closely align
text embeddings with pixel embeddings still has limitations in existing
approaches. To address this issue, we propose OTSeg, a novel multimodal
attention mechanism aimed at enhancing the potential of multiple text prompts
for matching associated pixel embeddings. We first propose Multi-Prompts
Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads
multiple text prompts to selectively focus on various semantic features within
image pixels. Moreover, inspired by the success of Sinkformers in unimodal
settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn
Attention (MPSA), which effectively replaces cross-attention mechanisms within
Transformer framework in multimodal settings. Through extensive experiments, we
demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with
significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three
benchmark datasets.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要