SVT: Supertoken Video Transformer for Efficient Video Understanding

Chenbin Pan,Rui Hou,Hanchao Yu,Qifan Wang,Senem Velipasalar,Madian Khabsa

CoRR（2023）

引用 0|浏览46

暂无评分

摘要

Whether by processing videos with fixed resolution from start to end or incorporating pooling and down-scaling strategies, existing video transformers process the whole video content throughout the network without specially handling the large portions of redundant information. In this paper, we present a Supertoken Video Transformer (SVT) that incorporates a Semantic Pooling Module (SPM) to aggregate latent representations along the depth of visual transformer based on their semantics, and thus, reduces redundancy inherent in video inputs.~Qualitative results show that our method can effectively reduce redundancy by merging latent representations with similar semantics and thus increase the proportion of salient information for downstream tasks.~Quantitatively, our method improves the performance of both ViT and MViT while requiring significantly less computations on the Kinectics and Something-Something-V2 benchmarks.~More specifically, with our SPM, we improve the accuracy of MAE-pretrained ViT-B and ViT-L by 1.5% with 33% less GFLOPs and by 0.2% with 55% less FLOPs, respectively, on the Kinectics-400 benchmark, and improve the accuracy of MViTv2-B by 0.2% and 0.3% with 22% less GFLOPs on Kinectics-400 and Something-Something-V2, respectively.

查看译文

关键词

supertoken video transformer

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要