Probing Text Models for Common Ground with Visual Representations

arxiv(2020)

引用 0|浏览187
暂无评分
摘要
Vision, as a central component of human perception, plays a fundamental role in shaping natural language. To better understand how text models are connected to our visual perceptions, we propose a method for examining the similarities between neural representations extracted from words in text and objects in images. Our approach uses a lightweight probing model that learns to map language representations of concrete words to the visual domain. We find that representations from models trained on purely textual data, such as BERT, can be nontrivially mapped to those of a vision model. Such mappings generalize to object categories that were never seen by the probe during training, unlike mappings learned from permuted or random representations. Moreover, we find that the context surrounding objects in sentences greatly impacts performance. Finally, we show that humans significantly outperform all examined models, suggesting considerable room for improvement in representation learning and grounding.
更多
查看译文
关键词
probing text models,visual representations
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要