MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition.

Yubin Qin,Yang Wang,Zhiren Zhao,Xiaolong Yang, Yang Zhou, Shaojun Wei,Yang Hu,Shouyi Yin

International Symposium on Computer Architecture（2024）

引用 0|浏览10

暂无评分

摘要

Large language models (LLMs) have been showing surprising performance in processing language tasks, bringing a new prevalence to deploy LLM from cloud to edge. However, being a scaling auto-regressive Transformer with a huge parameter amount and generating output one by one, LLM introduces overwhelming memory footprints and computation during its inference, especially from its linear layers. For example, generating 32 output tokens with LLaMA-7B LLM requires 14GB of weight data and performs over 400 billion operations (98% from linear layers), which is far beyond the capability of consumer-level GPU and traditional accelerators. To solve these issues, we propose a memory-compute-efficient LLM accelerator, MECLA, with a parameter-efficient scaling sub-matrix partition method (SSMP). It decomposes large weight matrices into several tiny-scale source sub-matrices (SS) and derived sub-matrices (DS). Each DS can be obtained by scaling the corresponding SS with a scalar. For memory issues, SSMP avoids accessing the full weight matrix but only requires small SS and DS scaling scalars. For computation issues, the proposed MECLA processor fully exploits the intermediate data reuse of matrix multiplication via on-chip matrix regrouping, inner-product multiplication re-association, and outer-product partial sum reuse. Experiments on 20 benchmarks show that MECLA reduces memory access and computation by 83.6% and 72.2%. It achieves an energy efficiency of 7088GOPS/W. Compared to V100 GPU and state-of-the-art Transformer accelerator SpAtten and FACT, MECLA saves 113.14×, 12.99×, and 1.62× higher energy efficiency.

查看译文

关键词

large language model,transformer,accelerator,artificial intelligence

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要