End-to-End Video Scene Graph Generation With Temporal Propagation Transformer

IEEE TRANSACTIONS ON MULTIMEDIA(2024)

引用 0|浏览15
暂无评分
摘要
Video scene graph generation has been an emerging research topic, which aims to interpret a video as a temporally-evolving graph structure by representing video objects as nodes and their relations as edges. Existing approaches predominantly follow a multi-step scheme, including frame-level object detection, relation recognition and temporal association. Although effective, these approaches neglect the mutual interactions between independent steps, resulting in a sub-optimal solution. We present a novel end-to-end framework for video scene graph generation, which naturally unifies object detection, object tracking, and relation recognition via a new Transformer structure, namely Temporal Propagation Transformer (TPT). Particularly, TPT extends the existing Transformer-based object detector (e.g., DETR) along the temporal dimension by involving a query propagation module, which can additionally associate the detected instances by identities across frames. A temporal dynamics encoder is then leveraged to dynamically enrich the features of the detected instances for relation recognition by attending to their historic states in previous frames. Meanwhile, the relation propagation strategy is devised to emphasize the temporal consistency of relation recognition results among adjacent frames. Extensive experiments conducted on VidHOI and Action Genome benchmarks demonstrate the superior performance of the proposed TPT over the state-of-the-art methods.
更多
查看译文
关键词
Video scene graph generation,transformer,temporal propagation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要