A Video Captioning Method Based on Visual-Text Semantic Association

Yan Fu, Xinli Wei

2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP)(2023)

引用 0|浏览1
暂无评分
摘要
At present, video captioning methods based on codec frameworks often rely too much on the information of a single visual modality, which makes it difficult for the model to understand the video content accurately. To address this problem, this paper proposes a video captioning method based on visual-text semantic association (VC-VTSA) from the perspective of multimodal association. In the encoding stage, the method extracts 2D static features, 3D motion features, and object-level regional features of the video and integrates them into global visual features. In the semantic association phase, the generated vocabulary is combined into phrases with contextual semantic dependencies using a self-attentive mechanism, and they are associated with the visual features extracted in the encoding phase to create a bi-modal semantic region with visual content and textual information. By exploiting the potential associative complementary relationships between different modalities in the semantic region, the video content information is better characterized. In addition, a visual noise filtering strategy(VNFS) is designed in this paper to help the lexical phrases in the semantic zone and the corresponding visual content to be accurately associated. Finally, the constructed semantic regions are fed into the LSTM decoder for the next lexical prediction until the complete video captioning is generated.
更多
查看译文
关键词
video captioning,semantic region,visual noise filtering strategy
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要