Look, Remember and Reason: Grounded reasoning in videos with language models
arXiv (Cornell University)(2023)
摘要
Multi-modal language models (LM) have recently shown promising performance in
high-level reasoning tasks on videos. However, existing methods still fall
short in tasks like causal or compositional spatiotemporal reasoning over
actions, in which model predictions need to be grounded in fine-grained
low-level details, such as object motions and object interactions. In this
work, we propose training an LM end-to-end on low-level surrogate tasks,
including object detection, re-identification, and tracking, to endow the model
with the required low-level visual capabilities. We show that a two-stream
video encoder with spatiotemporal attention is effective at capturing the
required static and motion-based cues in the video. By leveraging the LM's
ability to perform the low-level surrogate tasks, we can cast reasoning in
videos as the three-step process of Look, Remember, Reason wherein visual
information is extracted using low-level visual skills step-by-step and then
integrated to arrive at a final answer. We demonstrate the effectiveness of our
framework on diverse visual reasoning tasks from the ACRE, CATER,
Something-Else and STAR datasets. Our approach is trainable end-to-end and
surpasses state-of-the-art task-specific methods across these tasks by a large
margin.
更多查看译文
关键词
visual reasoning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要