Frame Augmented Alternating Attention Network for Video Question Answering

IEEE Transactions on Multimedia(2020)

引用 66|浏览193
暂无评分
摘要
Vision and language understanding is one of the most fundamental and challenging problems in Multimedia Intelligence. Simultaneously understanding video actions with a related natural language question, and further produces accurate answer is even more challenging since it requires joint modeling information across modality. In the past few years, some studies begin to attack this problem by utilizing attention enhanced deep neural networks. However, simple attention mechanisms such as unidirectional attention fail to yield a better mapping between different modalities. Moreover, none of these Video QA models explore high-level semantics in augmented video-frame level. In this paper, we augmented each frame representation with its context information by a novel feature extractor that combines the advantages of Resnet and a variant of C3D. In addition, we proposed a novel alternating attention network which can alternately attend frame regions, video frames and words in the question in multi-turns. This yields better joint representations of video and question, further help the deep model to discover the deeper relationship between two modalities. Our method outperforms the state-of-the-art Video QA models on two existing video question answering datasets. Further ablation studies proved that our feature extractor and the alternating attention mechanism can improve the performance jointly.
更多
查看译文
关键词
Feature extraction,Visualization,Knowledge discovery,Task analysis,Data mining,Neural networks,Semantics
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要