W-ART: Action Relation Transformer for Weakly-Supervised Temporal Action Localization

Mengzhu Li,Hongjun Wu,Yongcheng Liu,Hongzhe Liu,Cheng Xu,Xuewei Li

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)（2022）

引用 1|浏览10

暂无评分

摘要

Weakly-supervised temporal action localization (WTAL) is a long-standing and challenging research problem in video signal analysis. It is to localize the action segments in the video given only video-level labels. The key to this task is understanding how the diverse actions interact. In this paper, we propose W-ART, a relation Transformer to explicitly capture the relationships between action segments. We devise a new effective Transformer architecture and construct new training loss functions for WTAL. Further, we propose a dedicated query mechanism to satisfy the different feature preferences between classification and localization. Thanks to these designs, our W-ART can accurately localize the diverse actions even in weakly-supervised setting. Extensive evaluation and empirical analysis show that our method outperforms the state of the arts on two challenging benchmarks, Charades and THUMOS14.

查看译文

关键词

Weakly-supervised Temporal Action Localization,Long-range Temporal Segment Dependency,Relationship Transformer,Weakly-supervised Query Mechanism

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要