Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
arxiv(2024)
摘要
Current solutions for efficiently constructing large vision-language (VL)
models follow a two-step paradigm: projecting the output of pre-trained vision
encoders to the input space of pre-trained language models as visual prompts;
and then transferring the models to downstream VL tasks via end-to-end
parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits
inefficiency since it significantly increases the input length of the language
models. In this paper, in contrast to integrating visual prompts into inputs,
we regard visual prompts as additional knowledge that facilitates language
models in addressing tasks associated with visual information. Motivated by the
finding that Feed-Forward Network (FFN) of language models acts as "key-value
memory", we introduce a novel approach termed memory-space visual prompting
(MemVP), wherein visual prompts are concatenated with the weights of FFN for
visual knowledge injection. Experimental results across various VL tasks and
language models reveal that MemVP significantly reduces the training time and
inference latency of the finetuned VL models and surpasses the performance of
previous PEFT methods. Code: https://github.com/JieShibo/MemVP
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要