MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning.

David Junhao Zhang,Kunchang Li,Yali Wang,Yunpeng Chen,Shashwat Chandra,Yu Qiao,Luoqi Liu,Mike Zheng Shou

European Conference on Computer Vision（2022）

引用 9|浏览33

暂无评分

摘要

Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., \(\mathtt {MorphFC_s}\) and \(\mathtt {MorphFC_t}\), for spatial and temporal modeling respectively. \(\mathtt {MorphFC_s}\) can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, \(\texttt{MorphFC}_t\) can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance. Finally, we evaluate our MorphMLP on a number of popular video benchmarks. Compared with the recent state-of-the-art models, MorphMLP significantly reduces computation but with better accuracy, e.g., MorphMLP-S only uses 50% GFLOPs of VideoSwin-T but achieves 0.9% top-1 improvement on Kinetics400, under ImageNet1K pretraining. MorphMLP-B only uses 43% GFLOPs of MViT-B but achieves 2.4% top-1 improvement on SSV2, even though MorphMLP-B is pretrained on ImageNet1K while MViT-B is pretrained on Kinetics400. Moreover, our method adapted to the image domain outperforms previous SOTA MLP-Like architectures. Code is available at https://github.com/MTLab/MorphMLP.

查看译文

关键词

MLP,Video classification,Representation learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要