MV2MAE: Multi-View Video Masked Autoencoders
CoRR(2024)
摘要
Videos captured from multiple viewpoints can help in perceiving the 3D
structure of the world and benefit computer vision tasks such as action
recognition, tracking, etc. In this paper, we present a method for
self-supervised learning from synchronized multi-view videos. We use a
cross-view reconstruction task to inject geometry information in the model. Our
approach is based on the masked autoencoder (MAE) framework. In addition to the
same-view decoder, we introduce a separate cross-view decoder which leverages
cross-attention mechanism to reconstruct a target viewpoint video using a video
from source viewpoint, to help representations robust to viewpoint changes. For
videos, static regions can be reconstructed trivially which hinders learning
meaningful representations. To tackle this, we introduce a motion-weighted
reconstruction loss which improves temporal modeling. We report
state-of-the-art results on the NTU-60, NTU-120 and ETRI datasets, as well as
in the transfer learning setting on NUCLA, PKU-MMD-II and ROCOG-v2 datasets,
demonstrating the robustness of our approach. Code will be made available.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要