谷歌浏览器插件
订阅小程序
在清言上使用

CBQ: Cross-Block Quantization for Large Language Models

arXiv (Cornell University)(2023)

引用 0|浏览50
暂无评分
摘要
Post-training quantization (PTQ) has played a key role in compressing largelanguage models (LLMs) with ultra-low costs. However, existing PTQ methods onlyfocus on handling the outliers within one layer or one block, which ignores thedependency of blocks and leads to severe performance degradation in low-bitsettings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQmethod for LLMs. CBQ employs a cross-block dependency using a homologousreconstruction scheme, establishing long-range dependencies across multipleblocks to minimize error accumulation. Furthermore, CBQ incorporates acoarse-to-fine preprocessing (CFP) strategy for suppressing weight andactivation outliers, coupled with an adaptive LoRA-Rounding technique forprecise weight quantization. These innovations enable CBQ to not only handleextreme outliers effectively but also improve overall quantization accuracy.Extensive experiments show that CBQ achieves superior low-bit quantization(W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods acrossvarious LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B modelwithin only 4.3 hours on a single GPU, achieving a commendable tradeoff betweenperformance and quantization efficiency.
更多
查看译文
关键词
Language Modeling,Statistical Language Modeling,Topic Modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要