Semi-supervised Visual Feature Integration for Language Models through Sentence Visualization

Multimodal Interfaces and Machine Learning for Multimodal Interaction(2021)

引用 5|浏览18
暂无评分
摘要
ABSTRACT Integrating visual features has been proved useful for natural language understanding tasks. Nevertheless, most existing multimodal language models highly rely on training on aligned image and text data. In this paper, we propose a novel semi-supervised visual integration framework for pre-trained language models. In the framework, the visual features are obtained through a sentence visualization and vision-language fusion mechanism. The uniqueness includes: 1) the integration is conducted via a semi-supervised framework and does not require aligned images for the processed sentences. 2) the framework works as an auxiliary component, and will not affect the language processing ability of the integrated language model. Experimental results on both natural language inference and reading comprehension tasks demonstrate that our framework improves the strong baseline language models. Considering that our framework only requires an image database, and does not require aligned images for the processed texts, it provides a feasible way for multimodal language learning.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要