Less is Better: Exponential Loss for Cross-Modal Matching

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 0|浏览59
暂无评分
摘要
Deep metric learning has become a key component of cross-modal retrieval. By learning to pull the features of matched instances closer while pushing the features of mismatched instances farther away, one can learn highly robust multi-modal representations. Most existing cross-modal retrieval methods leverage vanilla triplet loss to train the network, which cannot adaptively penalize pairs with different hardness. Although various weighting strategies have been designed for unimodal matching tasks, few weighting strategies have been applied to cross-modal tasks due to the specificity of cross-modal tasks. While few weighting strategies are designed for cross-modal scenarios, they usually involve a lot of hyper-parameters, which require a lot of computational resources to fine-tune. In this paper, we introduce a new exponential loss, which can assign appropriate weights to individual positive and negative pairs according to their similarity so that it can adaptively penalize pairs with different hardness. Furthermore, the exponential loss has only two hyper-parameters, making it easier to find the optimal parameters to suit various data distributions in practice. Exponential loss can be universally applied to well-established cross-modal models and further boost their retrieval performance. We exhaustively ablate our method on Image-Text matching, Video-Text matching, as well as unimodal Image matching. Experimental results show that a standard model trained with exponential loss can achieve noticeable performance gains.
更多
查看译文
关键词
Exponential loss,video-text matching,image-text matching
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要