EViLBERT: Learning Task-Agnostic Multimodal Sense Embeddings

IJCAI 2020(2020)

引用 10|浏览158
暂无评分
摘要
The problem of grounding language in vision is increasingly attracting scholarly efforts. As of now, however, most of the approaches have been limited to word embeddings, which are not capable of handling polysemous words. This is mainly due to the limited coverage of the available semanticallyannotated datasets, hence forcing research to rely on alternative technologies (i.e., image search engines). To address this issue, we introduce EViLBERT, an approach which is able to perform image classification over an open set of concepts, both concrete and non-concrete. Our approach is based on the recently introduced Vision-Language Pretraining (VLP) model, and builds upon a manuallyannotated dataset of concept-image pairs. We use our technique to clean up the image-to-concept mapping that is provided within a multilingual knowledge base, resulting in over 258;000 images associated with 42;500 concepts. We show that our VLP-based model can be used to create multimodal sense embeddings starting from our automaticallycreated dataset. In turn, we also show that these multimodal embeddings improve the performance of a Word Sense Disambiguation architecture over a strong unimodal baseline. We release code, dataset and embeddings at http://babelpic.org.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要