LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
CoRR(2024)
摘要
Large Multimodal Models (LMMs) have shown significant reasoning capabilities
by connecting a visual encoder and a large language model. LMMs typically use a
fixed amount of visual tokens, such as the penultimate layer features in the
CLIP visual encoder, as the prefix content. Recent LMMs incorporate more
complex visual inputs, such as high-resolution images and videos, which
increase the number of visual tokens significantly. However, due to the design
of the Transformer architecture, computational costs associated with these
models tend to increase quadratically with the number of input tokens. To
tackle this problem, we explore a token reduction mechanism and find, similar
to prior work, that many visual tokens are spatially redundant. Based on this,
we propose PruMerge, a novel adaptive visual token reduction approach, which
largely reduces the number of visual tokens while maintaining comparable model
performance. We first select the unpruned visual tokens based on their
similarity to class tokens and spatial tokens. We then cluster the pruned
tokens based on key similarity and merge the clustered tokens with the unpruned
tokens to supplement their information. Empirically, when applied to LLaVA-1.5,
our approach can compress the visual tokens by 14.4 times on average, and
achieve comparable performance across diverse visual question-answering and
reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要