Bridging the gap: dual perception attention and local-global similarity fusion for cross-modal image-text matching

Xiangyu Shui, Zhenfang Zhu, Yun Liu,Hongli Pei,Kefeng Li,Huaxiang Zhang

Multimedia Tools and Applications(2024)

引用 0|浏览3
暂无评分
摘要
Current image-text matching methods implicitly align visual-semantic segments within images, and employ cross-modal attention mechanisms to discover fine-grained cross-modal semantic correspondences. Although region-word pairs constitute local matches across modalities, they may lead to inaccurate measurements of relevance when viewed from a global perspective of image-text relationships. Additionally, cross-modal attention mechanisms may introduce redundant or irrelevant region-word alignments, which can reduce retrieval accuracy and limit efficiency. To address these challenges, we propose a D ual perception A ttention and local-global S imilarity F usion framework(DASF). Specifically, We combine two types of similarity matching, global and local, to establish a more accurate correspondence between images and text by simultaneously considering global semantics and local details during the matching process. Simultaneously, we integrate dual-perception attention mechanisms to learn the relationship between images and text, utilizing attention polarity to determine the degree of matching and better consider contextual and semantic information, thereby reducing interference from irrelevant regions. Extensive experiments on two benchmark datasets, Flickr30K and MSCOCO, demonstrate the superior effectiveness of our DASF, achieving state-of-the-art performance.
更多
查看译文
关键词
Image-text Matching,Dual perception attention,Cross-modal
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要