CBQ: Cross-Block Quantization for Large Language Models

Xin Ding,Xiaoyu Liu,Zhijun Tu, Yun Zhang,Wei Li,Jie Hu,Hanting Chen,Yehui Tang,Zhiwei Xiong,Baoqun Yin,Yunhe Wang

arXiv (Cornell University)（2023）

引用 0|浏览50

暂无评分

摘要

Post-training quantization (PTQ) has played a key role in compressing largelanguage models (LLMs) with ultra-low costs. However, existing PTQ methods onlyfocus on handling the outliers within one layer or one block, which ignores thedependency of blocks and leads to severe performance degradation in low-bitsettings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQmethod for LLMs. CBQ employs a cross-block dependency using a homologousreconstruction scheme, establishing long-range dependencies across multipleblocks to minimize error accumulation. Furthermore, CBQ incorporates acoarse-to-fine preprocessing (CFP) strategy for suppressing weight andactivation outliers, coupled with an adaptive LoRA-Rounding technique forprecise weight quantization. These innovations enable CBQ to not only handleextreme outliers effectively but also improve overall quantization accuracy.Extensive experiments show that CBQ achieves superior low-bit quantization(W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods acrossvarious LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B modelwithin only 4.3 hours on a single GPU, achieving a commendable tradeoff betweenperformance and quantization efficiency.

查看译文

关键词

Language Modeling,Statistical Language Modeling,Topic Modeling

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要