MSVT: Multiple Spatiotemporal Views Transformer for DeepFake Video Detection

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 0|浏览37
暂无评分
摘要
Recently, DeepFake videos have developed rapidly, causing new security issues in society. Due to the rough spatiotemporal view, existing video-based detection methods struggle to capture fine-grained spatiotemporal information, resulting in limited generalization ability. In addition, although the transformer has achieved great success in the past few years, the application of transformer on deepfake video detection still needs to be studied. To solve this problem, in this paper, we propose a novel Multiple Spatiotemporal Views Transformer (MSVT) with Local Spatiotemporal View (LSV) and Global Spatiotemporal View (GSV), to mine more detailed spatiotemporal information. Firstly, for establishing the LSV, different from existing works that sparsely sample a single frame to build the input sequence, we employ the local-consecutive temporal view to capture vital dynamic inconsistency. Furthermore, the extracted frame features within each group are fed to the temporal transformer followed by the feature fusion module, to generate group-level spatiotemporal features. Then, we further establish Global Spatiotemporal View (GSV) by feeding all the frame features within the whole video to the temporal transformer followed by the feature fusion module. Finally, we propose a novel global-local transformer (GLT) to effectively integrate these multi-level features for mining more subtle and comprehensive features. Extensive experiments on six large datasets demonstrate that our MSVT outperforms state-of-the-art detection methods.
更多
查看译文
关键词
Generalized DeepFake detection,multiple spatiotemporal views,global-local transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要