AI and Memory Wall
IEEE Micro(2024)
摘要
The availability of unprecedented unsupervised training data, along with
neural scaling laws, has resulted in an unprecedented surge in model size and
compute requirements for serving/training LLMs. However, the main performance
bottleneck is increasingly shifting to memory bandwidth. Over the past 20
years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the
growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and
1.4 times every 2 years, respectively. This disparity has made memory, rather
than compute, the primary bottleneck in AI applications, particularly in
serving. Here, we analyze encoder and decoder Transformer models and show how
memory bandwidth can become the dominant bottleneck for decoder models. We
argue for a redesign in model architecture, training, and deployment strategies
to overcome this memory limitation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要