IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact
arxiv(2024)
摘要
Large language models (LLMs) excel in natural language processing but demand
intensive computation. To mitigate this, various quantization methods have been
explored, yet they compromise LLM performance. This paper unveils a previously
overlooked type of outlier in LLMs. Such outliers are found to allocate most of
the attention scores on initial tokens of input, termed as pivot tokens, which
is crucial to the performance of quantized LLMs. Given that, we propose
IntactKV to generate the KV cache of pivot tokens losslessly from the
full-precision model. The approach is simple and easy to combine with existing
quantization solutions. Besides, IntactKV can be calibrated as additional LLM
parameters to boost the quantized LLMs further. Mathematical analysis also
proves that IntactKV effectively reduces the upper bound of quantization error.
Empirical results show that IntactKV brings consistent improvement and achieves
lossless weight-only INT4 quantization on various downstream tasks, leading to
the new state-of-the-art for LLM quantization.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要