MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples
CoRR(2023)
摘要
Although In-Context Learning (ICL) brings remarkable performance gains to
Large Language Models (LLMs), the improvements remain lower than fine-tuning on
downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT),
a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by
fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We
propose the Multi-Modal Hub (M-Hub), a unified module that captures various
multi-modal features according to different inputs and objectives. Based on
M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual
features and subsequently generate outputs conditioned on the textual-guided
visual features. Moreover, leveraging the flexibility of M-Hub, we design a
variety of in-context demonstrations. Extensive experiments on a diverse range
of downstream multi-modal tasks demonstrate that MMICT significantly
outperforms traditional fine-tuning strategy and the vanilla ICT method that
directly takes the concatenation of all information from different modalities
as input.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要