Scaling Distributed Machine Learning With In-Network Aggregation

Amedeo Sapio,Marco Canini,Chen-Yu Ho,Jacob Nelson,Panos Kalnis,Changhoon Kim,Arvind Krishnamurthy,Masoud Moshref,Dan R. K. Ports,Peter Richtárik

PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON NETWORKED SYSTEM DESIGN AND IMPLEMENTATION（2021）

引用 228|浏览1

暂无评分

摘要

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide an efficient solution that speeds up training by up to 5.5 x for a number of real-world benchmark models.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要