Bidirectional Retrieval Made Simple
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)(2018)
摘要
This paper provides a very simple yet effective character-level architecture for learning bidirectional retrieval models. Aligning multimodal content is particularly challenging considering the difficulty in finding semantic correspondence between images and descriptions. We introduce an efficient character-level inception module, designed to learn textual semantic embeddings by convolving raw characters in distinct granularity levels. Our approach is capable of explicitly encoding hierarchical information from distinct base-level representations (e.g., characters, words, and sentences) into a shared multimodal space, where it maps the semantic correspondence between images and descriptions via a contrastive pairwise loss function that minimizes order-violations. Models generated by our approach are far more robust to input noise than state-of-the-art strategies based on word-embeddings. Despite being conceptually much simpler and requiring fewer parameters, our models outperform the state-of-the-art approaches by 4.8% in the task of description retrieval and 2.7% (absolute R@ 1 values) in the task of image retrieval in the popular MS COCO retrieval dataset. We also show that our models present solid performance for text classification, specially in multilingual and noisy domains.
更多查看译文
关键词
contrastive pairwise loss function,word-embeddings,image retrieval,bidirectional retrieval models,multimodal content,semantic correspondence,textual semantic embeddings,raw characters,distinct granularity levels,hierarchical information,shared multimodal space,character-level architecture,MS COCO retrieval dataset,order-violation minimization,character-level inception module,base-level representations,text classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络