RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021(2021)

引用 199|浏览298
暂无评分
摘要
Recent progress on visual question answering has explored the merits of grid features for vision language tasks. Meanwhile, transformer-based models have shown remarkable performance in various sequence prediction problems. However, the spatial information loss of grid features caused by flattening operation, as well as the defect of the transformer model in distinguishing visual words and non visual words, are still left unexplored. In this paper, we first propose Grid-Augmented (GA) module, in which relative geometry features between grids are incorporated to enhance visual representations. Then, we build a BERT-based language model to extract language context and propose Adaptive-Attention (AA) module on top of a transformer decoder to adaptively measure the contribution of visual and language cues before making decisions for word prediction. To prove the generality of our proposals, we apply the two modules to the vanilla transformer model to build our Relationship-Sensitive Transformer (RSTNet) for image captioning task. The proposed model is tested on the MSCOCO benchmark, where it achieves new state-of-art results on both the Karpathy test split and the online test server. Source code is available at GitHub(1).
更多
查看译文
关键词
RSTNet,nonvisual words,visual question answering,grid features,vision language tasks,transformer-based models,sequence prediction problems,spatial information loss,geometry features,visual representations,BERT-based language model,language context,transformer decoder,visual cues,word prediction,vanilla transformer model,image captioning task,relationship-sensitive transformer,adaptive-attention module,grid-augmented module,MSCOCO benchmark,Karpathy test split,language cues
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要