Towards Language-Driven Video Inpainting via Multimodal Large Language Models
CoRR(2024)
摘要
We introduce a new task – language-driven video inpainting, which uses
natural language instructions to guide the inpainting process. This approach
overcomes the limitations of traditional video inpainting methods that depend
on manually labeled binary masks, a process often tedious and labor-intensive.
We present the Remove Objects from Videos by Instructions (ROVI) dataset,
containing 5,650 videos and 9,091 inpainting results, to support training and
evaluation for this task. We also propose a novel diffusion-based
language-driven video inpainting framework, the first end-to-end baseline for
this task, integrating Multimodal Large Language Models to understand and
execute complex language-based inpainting requests effectively. Our
comprehensive results showcase the dataset's versatility and the model's
effectiveness in various language-instructed inpainting scenarios. We will make
datasets, code, and models publicly available.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要