Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context

CVPR(2020)

引用 63|浏览159
暂无评分
摘要
Video visual relation detection (VidVRD) aims to describe all interacting objects in a video. Different from relationships in static images, videos contain an addition temporal channel. A majority of existing works divide a video into short segments, predict relationships in each segment, and merge them. Such methods cannot capture relations involving long motions. Predicting the same relationship across neighboring video segments is also inefficient. To address these issues, this work proposes a novel sliding-window scheme to simultaneously predict short-term and long-term relationships. We run windows with different kernel sizes on object tracklets to generate sub-tracklet proposals with different duration, while the computational load is similar to that in segment-based methods. To fully utilize spatial and temporal information in videos, we construct one spatial and one temporal graph and employ Graph Convloutional Network to generate contextual embedding for tracklet proposal compatibility evaluation. We only predict relationships on highly-compatible proposal pairs. Our method achieves state-of-the-art performance on both ImageNet-VidVRD and VidOR dataset across multiple tasks. Especially for ImageNet-VidVRD, we obtain an average of 3% (R@50 from 8.07% to 11.21%) improvement under all evaluation metrics.
更多
查看译文
关键词
graph convloutional network,ImageNet-VidVRD,highly-compatible proposal pairs,tracklet proposal compatibility evaluation,temporal graph,video temporal information,spatial information,segment-based methods,sub-tracklet proposals,object tracklets,long-term relationships,sliding-window scheme,video segments,long motions,short segments,temporal channel,static images,interacting objects,video visual relation detection,spatio-temporal global context,video relation detection,short-term snippet
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要