Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition
CoRR(2024)
摘要
With the explosive growth of video data in real-world applications, a
comprehensive representation of videos becomes increasingly important. In this
paper, we address the problem of video scene recognition, whose goal is to
learn a high-level video representation to classify scenes in videos. Due to
the diversity and complexity of video contents in realistic scenarios, this
task remains a challenge. Most existing works identify scenes for videos only
from visual or textual information in a temporal perspective, ignoring the
valuable information hidden in single frames, while several earlier studies
only recognize scenes for separate images in a non-temporal perspective. We
argue that these two perspectives are both meaningful for this task and
complementary to each other, meanwhile, externally introduced knowledge can
also promote the comprehension of videos. We propose a novel two-stream
framework to model video representations from multiple perspectives, i.e.
temporal and non-temporal perspectives, and integrate the two perspectives in
an end-to-end manner by self-distillation. Besides, we design a
knowledge-enhanced feature fusion and label prediction method that contributes
to naturally introducing knowledge into the task of video scene recognition.
Experiments conducted on a real-world dataset demonstrate the effectiveness of
our proposed method.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要