Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching.

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society(2024)

引用 2|浏览84
暂无评分
摘要
The mainstream of image and sentence matching studies currently focuses on fine-grained alignment of image regions and sentence words. However, these methods miss a crucial fact: the correspondence between images and sentences does not simply come from alignments between individual regions and words but from alignments between the phrases they form respectively. In this work, we propose a novel Decoupled Cross-modal Phrase-Attention network (DCPA) for image-sentence matching by modeling the relationships between textual phrases and visual phrases. Furthermore, we design a novel decoupled manner for training and inferencing, which is able to release the trade-off for bi-directional retrieval, where image-to-sentence matching is executed in textual semantic space and sentence-to-image matching is executed in visual semantic space. Extensive experimental results on Flickr30K and MS-COCO demonstrate that the proposed method outperforms state-of-the-art methods by a large margin, and can compete with some methods introducing external knowledge.
更多
查看译文
关键词
matching,cross-modal,phrase-attention,image-sentence
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要