Visual Commonsense-Aware Representation Network for Video Captioning

Pengpeng Zeng,Haonan Zhang,Lianli Gao,Xiangpeng Li,Jin Qian,Heng Tao Shen

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS（2023）

引用 0|浏览35

暂无评分

摘要

Generating consecutive descriptions for videos, that is, video captioning, requires taking full advantage of visual representation along with the generation process. Existing video captioning methods focus on an exploration of spatial-temporal representations and their relationships to produce inferences. However, such methods only exploit the superficial association contained in a video itself without considering the intrinsic visual commonsense knowledge that exists in a video dataset, which may hinder their capabilities of knowledge cognitive to reason accurate descriptions. To address this problem, we propose a simple, yet effective method, called visual commonsense-aware representation network (VCRN), for video captioning. Specifically, we construct a Video Dictionary, a plug-and-play component, obtained by clustering all video features from the total dataset into multiple clustered centers without additional annotation. Each center implicitly represents a visual commonsense concept in a video domain, which is utilized in our proposed visual concept selection (VCS) component to obtain a video-related concept feature. Next, a concept-integrated generation (CIG) component is proposed to enhance caption generation. Extensive experiments on three public video captioning benchmarks: MSVD, MSR-VTT, and VATEX, demonstrate that our method achieves state-of-the-art performance, indicating the effectiveness of our method. In addition, our method is integrated into the existing method of video question answering (VideoQA) and improves this performance, which further demonstrates the generalization capability of our method. The source code has been released at https://github.com/zchoi/VCRN.

查看译文

关键词

Visualization,Commonsense reasoning,Dictionaries,Task analysis,Knowledge based systems,Semantics,Decoding,Attention mechanism,language generation,video captioning,visual commonsense knowledge

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要