What Makes Multimodal In-Context Learning Work?
arxiv(2024)
摘要
Large Language Models have demonstrated remarkable performance across various
tasks, exhibiting the capacity to swiftly acquire new skills, such as through
In-Context Learning (ICL) with minimal demonstration examples. In this work, we
present a comprehensive framework for investigating Multimodal ICL (M-ICL) in
the context of Large Multimodal Models. We consider the best open-source
multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal
tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily
relies on text-driven mechanisms, showing little to no influence from the image
modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not
better than a simple strategy based on majority voting over context examples.
Moreover, we identify several biases and limitations of M-ICL that warrant
consideration prior to deployment. Code available at
https://gitlab.com/folbaeni/multimodal-icl}{gitlab.com/folbaeni/multimodal-icl
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要