Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization

Jun-Tae Lee,Mihir Jain,Hyoungwoo Park,Sungrack Yun

ICLR（2021）

引用 49|浏览107

暂无评分

摘要

Temporally localizing actions in videos is one of the key components for video understanding. Learning from weakly-labeled data is seen as a potential solution towards avoiding expensive frame-level annotations. Different from other works, which only depend on the visual-modality, we propose to learn richer audio-visual representations for weakly-supervised action localization. First, we propose a multi-stage cross-attention mechanism to collaboratively fuse audio and visual features, which preserves the intra-modal characteristics. Second, to model both foreground and background frames, we construct an open-max classifier, which treats the background class as an open-set. Third, for precise action localization, we design consistency losses to enforce temporal continuity for the action-class prediction, and also help with foreground-prediction reliability. Extensive experiments on two publicly available video-datasets (AVE and ActivityNet1.2) show that the proposed method effectively fuses audio and visual modalities, and achieves the state-of-the-art results for weakly-supervised action localization.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要