Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection
arxiv(2024)
摘要
Weakly supervised video anomaly detection (WSVAD) is a challenging task.
Generating fine-grained pseudo-labels based on weak-label and then
self-training a classifier is currently a promising solution. However, since
the existing methods use only RGB visual modality and the utilization of
category text information is neglected, thus limiting the generation of more
accurate pseudo-labels and affecting the performance of self-training. Inspired
by the manual labeling process based on the event description, in this paper,
we propose a novel pseudo-label generation and self-training framework based on
Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer
the rich language-visual knowledge of the contrastive language-image
pre-training (CLIP) model for aligning the video event description text and
corresponding video frames to generate pseudo-labels. Specifically, We first
fine-tune the CLIP for domain adaptation by designing two ranking losses and a
distributional inconsistency loss. Further, we propose a learnable text prompt
mechanism with the assist of a normality visual prompt to further improve the
matching accuracy of video event description text and video frames. Then, we
design a pseudo-label generation module based on the normality guidance to
infer reliable frame-level pseudo-labels. Finally, we introduce a temporal
context self-adaptive learning module to learn the temporal dependencies of
different video events more flexibly and accurately. Extensive experiments show
that our method achieves state-of-the-art performance on two benchmark
datasets, UCF-Crime and XD-Viole
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要