Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
CoRR(2024)
摘要
The interaction between humans and artificial intelligence (AI) is a crucial
factor that reflects the effectiveness of multimodal large language models
(MLLMs). However, current MLLMs primarily focus on image-level comprehension
and limit interaction to textual instructions, thereby constraining their
flexibility in usage and depth of response. In this paper, we introduce the
Draw-and-Understand project: a new model, a multi-domain dataset, and a
challenging benchmark for visual prompting. Specifically, we propose SPHINX-V,
a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a
vision encoder, a visual prompt encoder and an LLM for various visual prompts
(points, bounding boxes, and free-form shape) and language understanding. To
advance visual prompting research for MLLMs, we introduce MDVP-Data and
MDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M unique
image-visual prompt-text instruction-following samples, including natural
images, document images, OCR images, mobile screenshots, web screenshots, and
multi-panel images. Furthermore, we present MDVP-Bench, a comprehensive and
challenging benchmark to assess a model's capability in understanding visual
prompting instructions. Our experiments demonstrate SPHINX-V's impressive
multimodal interaction capabilities through visual prompting, revealing
significant improvements in detailed pixel-level description and
question-answering abilities.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要