Pair-wise Layer Attention with Spatial Masking for Video Prediction.
CoRR(2023)
摘要
Video prediction yields future frames by employing the historical frames and
has exhibited its great potential in many applications, e.g., meteorological
prediction, and autonomous driving. Previous works often decode the ultimate
high-level semantic features to future frames without texture details, which
deteriorates the prediction quality. Motivated by this, we develop a Pair-wise
Layer Attention (PLA) module to enhance the layer-wise semantic dependency of
the feature maps derived from the U-shape structure in Translator, by coupling
low-level visual cues and high-level features. Hence, the texture details of
predicted frames are enriched. Moreover, most existing methods capture the
spatiotemporal dynamics by Translator, but fail to sufficiently utilize the
spatial features of Encoder. This inspires us to design a Spatial Masking (SM)
module to mask partial encoding features during pretraining, which adds the
visibility of remaining feature pixels by Decoder. To this end, we present a
Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video
prediction to capture the spatiotemporal dynamics, which reflect the motion
trend. Extensive experiments and rigorous ablation studies on five benchmarks
demonstrate the advantages of the proposed approach. The code is available at
GitHub.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要