Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models
arxiv(2024)
摘要
Adapter-based parameter-efficient transfer learning has achieved exciting
results in vision-language models. Traditional adapter methods often require
training or fine-tuning, facing challenges such as insufficient samples or
resource limitations. While some methods overcome the need for training by
leveraging image modality cache and retrieval, they overlook the text
modality's importance and cross-modal cues for the efficient adaptation of
parameters in visual-language models. This work introduces a cross-modal
parameter-efficient approach named XMAdapter. XMAdapter establishes cache
models for both text and image modalities. It then leverages retrieval through
visual-language bimodal information to gather clues for inference. By
dynamically adjusting the affinity ratio, it achieves cross-modal fusion,
decoupling different modal similarities to assess their respective
contributions. Additionally, it explores hard samples based on differences in
cross-modal affinity and enhances model performance through adaptive adjustment
of sample learning intensity. Extensive experimental results on benchmark
datasets demonstrate that XMAdapter outperforms previous adapter-based methods
significantly regarding accuracy, generalization, and efficiency.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要