LITA: Language Instructed Temporal-Localization Assistant
arxiv(2024)
摘要
There has been tremendous progress in multimodal Large Language Models
(LLMs). Recent works have extended these models to video input with promising
instruction following capabilities. However, an important missing piece is
temporal localization. These models cannot accurately answer the "When?"
questions. We identify three key aspects that limit their temporal localization
capabilities: (i) time representation, (ii) architecture, and (iii) data. We
address these shortcomings by proposing Language Instructed
Temporal-Localization Assistant (LITA) with the following features: (1) We
introduce time tokens that encode timestamps relative to the video length to
better represent time in videos. (2) We introduce SlowFast tokens in the
architecture to capture temporal information at fine temporal resolution. (3)
We emphasize temporal localization data for LITA. In addition to leveraging
existing video datasets with timestamps, we propose a new task, Reasoning
Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for
learning and evaluating this task. Reasoning temporal localization requires
both the reasoning and temporal localization of Video LLMs. LITA demonstrates
strong performance on this challenging task, nearly doubling the temporal mean
intersection-over-union (mIoU) of baselines. In addition, we show that our
emphasis on temporal localization also substantially improves video-based text
generation compared to existing Video LLMs, including a 36
improvement of Temporal Understanding. Code is available at:
https://github.com/NVlabs/LITA
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要