Listen and Look: Multi-Modal Aggregation and Co-Attention Network for Video-Audio Retrieval

Xiaoshuai Hao,Wanqian Zhang,Dayan Wu,Fei Zhu,Bo Li

IEEE International Conference on Multimedia and Expo (ICME)（2022）

引用 3|浏览33

暂无评分

摘要

Video is a natural source of multi-modal data with intrinsic correlations between different modalities, such as objects, motions and captions. Though intuitive, such inherent supervision has not been well explored in previous video-audio retrieval works. Besides, existing methods exploit the video stream and the audio stream seperately, whereas ignoring the mutual interactions between them. In this paper, we propose a two-stream model named Multi-modal Aggregation and Co-attention network (MAC), which processes video and audio inputs with co-attentional interactions. Specifically, our method takes raw videos as inputs and extracts aggregated features from multiple modalities to benefit the video representation learning. Then, we introduce the self-attention mechanism to make videos adaptively assign higher weights to the representative modalities. Moreover, we introduce a coattention transformer module to better capture the relations among videos and audios. By exchanging key-value pairs in the multi-headed attention, this module enables video-attended audio features to be incorporated into video representations and vice versa. Experiments show that our method significantly outperform other state-of-the-arts.

查看译文

关键词

aggregation,multi-modal,co-attention,video-audio

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要