Transtl: Spatial-Temporal Localization Transformer for Multi-Label Video Classification.

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)(2022)

引用 3|浏览14
暂无评分
摘要
Multi-label video classification (MLVC) is a long-standing and challenging research problem in video signal analysis. Generally, there exist many complex action labels in real-world videos and these actions are with inherent dependencies at both spatial and temporal domains. Motivated by this observation, we propose TranSTL, a spatial-temporal localization Transformer framework for MLVC task. In addition to leverage global action label co-occurrence, we also propose a novel plug-and-play Spatial Temporal Label Dependency (STLD) layer in TranSTL. STLD not only dynamically models the label co-occurrence in a video by self-attention mechanism, but also fully captures spatial-temporal label dependencies using cross-attention strategy. As a result, our TranSTL is able to explicitly and accurately grasp the diverse action labels at both spatial and temporal domains. Extensive evaluation and empirical analysis show that TranSTL achieves superior performance over the state of the arts on two challenging benchmarks, Charades and Multi-Thumos.
更多
查看译文
关键词
Multi-label Video Classification,Label Co-occurrence Dependency,Spatial Temporal Label Dependency,Transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要