SalFoM: Dynamic Saliency Prediction with Video Foundation Models
arxiv(2024)
摘要
Recent advancements in video saliency prediction (VSP) have shown promising
performance compared to the human visual system, whose emulation is the primary
goal of VSP. However, current state-of-the-art models employ spatio-temporal
transformers trained on limited amounts of data, hindering generalizability
adaptation to downstream tasks. The benefits of vision foundation models
present a potential solution to improve the VSP process. However, adapting
image foundation models to the video domain presents significant challenges in
modeling scene dynamics and capturing temporal information. To address these
challenges, and as the first initiative to design a VSP model based on video
foundation models, we introduce SalFoM, a novel encoder-decoder video
transformer architecture. Our model employs UnMasked Teacher (UMT) as feature
extractor and presents a heterogeneous decoder which features a locality-aware
spatio-temporal transformer and integrates local and global spatio-temporal
information from various perspectives to produce the final saliency map. Our
qualitative and quantitative experiments on the challenging VSP benchmark
datasets of DHF1K, Hollywood-2 and UCF-Sports demonstrate the superiority of
our proposed model in comparison with the state-of-the-art methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要