Audiovisual SlowFast Networks for Video Recognition

arxiv(2020)

引用 214|浏览260
暂无评分
摘要
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast extends SlowFast Networks with a Faster Audio pathway that is deeply integrated with its visual counterparts. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we employ DropPathway that randomly drops the Audio pathway during training as a simple and effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization and show that it leads to better audiovisual features. We report state-of-the-art results on four video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to self-supervised tasks, where it improves over prior work. Code will be made available at: https://github.com/facebookresearch/SlowFast.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要