Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
CoRR(2023)
摘要
Large Vision and Language Models have enabled significant advances in fully
supervised and zero-shot vision tasks. These large pre-trained architectures
serve as the baseline to what is currently known as Instruction Tuning Large
Vision and Language models (IT-LVLMs). IT-LVLMs are general-purpose multi-modal
assistants whose responses are modulated by natural language instructions and
arbitrary visual data. Despite this versatility, IT-LVLM effectiveness in
fundamental computer vision problems remains unclear, primarily due to the
absence of a standardized evaluation benchmark. This paper introduces a
Multi-modal Evaluation Benchmark named MERLIM, a scalable test-bed to assess
the performance of IT-LVLMs on fundamental computer vision tasks. MERLIM
contains over 279K image-question pairs, and has a strong focus on detecting
cross-modal "hallucination" events in IT-LVLMs, where the language output
refers to visual concepts that lack any effective grounding in the image. Our
results show that state-of-the-art IT-LVMLs are still limited at identifying
fine-grained visual concepts, object hallucinations are common across tasks,
and their results are strongly biased by small variations in the input query,
even if the queries have the very same semantics. Our findings also suggest
that these models have weak visual groundings but they can still make adequate
guesses by global visual patterns or textual biases contained in the LLM
component.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要