Video Moment Retrieval via Comprehensive Relation-Aware Network

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 1|浏览17
暂无评分
摘要
Video moment retrieval aims to retrieve a target moment from an untrimmed video that semantically corresponds to the given language query. Existing methods commonly treat it as a regression task or a ranking task from the perspective of computer vision. Most of these works neglect comprehensive relations between video content and language context at a multi-granularity level and fail to efficiently model temporal relations among different video moments. In this paper, we formulate video moment retrieval into video reading comprehension by treating the input video as a text passage and language query as a question. To tackle the above impediments, we propose a Comprehensive Relation-aware Network (CRNet) to perceive comprehensive relations from extensive aspects. Specifically, we unite visual and textual features simultaneously at both clip-level and moment-level to thoroughly exploit inter-modality information, leading to a coarse-and-fine cross-modal interaction. Moreover, a background suppression module is introduced to restrain irrelevant background clips, meanwhile, a novel IoU attention mechanism and graph attention layer are efficiently devised to focus on the dependencies among highly-correlated video moments for the best choice selection. In-depth experiments on three public datasets TACoS, ActivityNet Captions, and Charades-STA demonstrate the superiority of our solution.
更多
查看译文
关键词
Video moment retrieval,temporal localization,cross-modal interaction,comprehensive relations learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要