Spatio-Temporal Gated Transformers for Efficient Video Processing

Yawei Li,Babak Ehteshami Bejnordi,Bert Moons,Tijmen Blankevoort,Amirhossein Habibian,Radu Timofte,Luc Van Gool

semanticscholar（2021）

引用 0|浏览28

暂无评分

摘要

We focus on the problem of efficient video stream processing with fully transformerbased architectures. Recent advances brought by transformers for image-based tasks inspires the research interests of applying transformers for videos. Yet, when applying image-based transformer solutions to videos, the computation becomes inefficient due to the redundant information in adjacent video frames. An analysis of the computation cost of the video object detection framework DETR identifies the linear layers as the major computation bottleneck. Thus, we propose dynamic gating layers to conduct conditional computation. With the generated binary or ternary gates, it is possible to avoid the computation for the stable background tokens in the video frames. The effectiveness of the dynamic gating mechanism for transformers is validated by experimental results. For video object detection, the FLOPs could be reduced by 48.3% without a significant drop of accuracy.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要