Listen and Look: Multi-Modal Aggregation and Co-Attention Network for Video-Audio Retrieval

IEEE International Conference on Multimedia and Expo (ICME)(2022)

引用 3|浏览33
暂无评分
摘要
Video is a natural source of multi-modal data with intrinsic correlations between different modalities, such as objects, motions and captions. Though intuitive, such inherent supervision has not been well explored in previous video-audio retrieval works. Besides, existing methods exploit the video stream and the audio stream seperately, whereas ignoring the mutual interactions between them. In this paper, we propose a two-stream model named Multi-modal Aggregation and Co-attention network (MAC), which processes video and audio inputs with co-attentional interactions. Specifically, our method takes raw videos as inputs and extracts aggregated features from multiple modalities to benefit the video representation learning. Then, we introduce the self-attention mechanism to make videos adaptively assign higher weights to the representative modalities. Moreover, we introduce a coattention transformer module to better capture the relations among videos and audios. By exchanging key-value pairs in the multi-headed attention, this module enables video-attended audio features to be incorporated into video representations and vice versa. Experiments show that our method significantly outperform other state-of-the-arts.
更多
查看译文
关键词
aggregation,multi-modal,co-attention,video-audio
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要