Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 2|浏览24
暂无评分
摘要
Many real-world video-text tasks involve different levels of granularity to represent local and global information with distinct semantics, such as frames and words, clips and sentences, or videos and paragraphs. Most existing multimodal representation learning methods suffer from limitations: (i) Adopting expert systems or manual design to extract more fine-grained local information (such as objects and actions in a video frame) for supervision may lead to information asymmetry since there may no corresponding information among modalities; (ii) Neglecting the hierarchical nature of the data to aggregate different levels of information from different modalities will cause insufficient representations. To alleviate the above issues, in this paper, we propose a Multi-Granularity Aggregation Transformer (MGAT) for joint video-audio-text representation learning. Specifically, for intra-modality, we first design a multi-granularity transformer module to relieve information asymmetry by making full use of local and global information within a single modality from different perspectives. Then, for inter-modality, we develop an attention-guided aggregation module to fuse audio and video information hierarchically. Last, we align the aggregated information with text information at different hierarchical levels via intra- and inter-modality consistency loss and contrastive loss. With the help of more granularity of information, we are able to obtain a well-performed representation model for a variety of tasks, e.g., video-paragraph retrieval and video captioning. Extensive experiments on two challenging benchmarks, i.e., ActivityNet-captions and Youcook2, demonstrate the superiority of our proposed method.
更多
查看译文
关键词
Multimodal representation learning,attention mechanism,multi-granularity aggregation,video-paragraph retrieval,video captioning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要