Step differences in instructional video
CVPR 2024(2024)
摘要
Comparing a user video to a reference how-to video is a key requirement for
AR/VR technology delivering personalized assistance tailored to the user's
progress. However, current approaches for language-based assistance can only
answer questions about a single video. We propose an approach that first
automatically generates large amounts of visual instruction tuning data
involving pairs of videos from HowTo100M by leveraging existing step
annotations and accompanying narrations, and then trains a video-conditioned
language model to jointly reason across multiple raw videos. Our model achieves
state-of-the-art performance at identifying differences between video pairs and
ranking videos based on the severity of these differences, and shows promising
ability to perform general reasoning over multiple videos.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要