Exploring Pairwise Relationships Adaptively From Linguistic Context in Image Captioning

Zongjian Zhang,Qiang Wu,Yang Wang,Fang Chen

IEEE TRANSACTIONS ON MULTIMEDIA（2022）

引用 12|浏览43

暂无评分

摘要

For image captioning, recent works start to focus on exploring visual relationships for generating high-quality interactive words (i.e. verbs and prepositions). However, many existing works only focus on semantic level by analysing the feature similarity between objects in the visual domain but ignore the linguistic context included in the caption decoder. When captioning is being carried out, the entity words can be inferred based on visual information of objects. The interactive words representing the relationships between entity words can only be inferred based on high-level language meaning generated in the process of captioning decoding. Such high-level language meaning is called linguistic context, which refers to the relational context between words or phrases in the caption sentences. The linguistic context can be used as strong guidance to explore related visual relationships between different objects effectively. To achieve this, we propose a novel context-adaptive attention module that is strongly driven by the linguistic context from the caption decoder. In this module, a novel design of visual relationship attention is proposed based on a bilinear self-attention model to explore related visual relationships and encode more discriminative features under the linguistic context. To achieve the adaptive process of attending to related visual relationships for generating interactive words or related visual objects for entity words, an attention modulator is integrated as an attention channel controller responding to the changing linguistic context of the caption decoder dynamically. Experimented on MSCOCO dataset, our model achieves promising performances compared with all counterpart models that explore visual relationships.

查看译文

关键词

Visualization,Linguistics,Decoding,Modulation,Context modeling,Adaptation models,Semantics,Bilinear attention,bilinear self-attention,context-adaptive attention,dynamic linguistic context,image captioning,visual relationship attention

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要