Efficient Large -Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan,Mohammad Shoeybi,Jared Casper,Patrick LeGresley,Mostofa Patwary,Vijay Korthikanti,Dmitri Vainbrand, Prethvi Kashinkunti,Julie Bernauer,Bryan Catanzaro,Amar Phanishayee,Matei Zaharia

arxiv（2021）

引用 412|浏览173

暂无评分

摘要

Large language models have led to state-of-the-art accuracies across several tasks. However, training these models efficiently is challenging because: a) CPU memory capacity is limited, making it impossible to fit large models on even a multi -CPU server, and b) the number of compute operations required can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to scaling issues at thousands of CPUs. In this paper, we show how tensor, pipeline, and data parallelism can be composed to scale to thousands of CPUs. We propose a novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 CPUs (per -CPU throughput of 52% of theoretical peak).

查看译文

关键词

clusters,language,training,large-scale

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要