LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models
arxiv(2021)
摘要
As model sizes in machine learning continue to scale, distributed training is
necessary to accommodate model weights within each device and to reduce
training time. However, this comes with the expense of increased communication
overhead due to the exchange of gradients and activations, which become the
critical bottleneck of the end-to-end training process. In this work, we
motivate the design of multi-dimensional networks within machine learning
systems as a cost-efficient mechanism to enhance overall network bandwidth. We
also identify that optimal bandwidth allocation is pivotal for
multi-dimensional networks to ensure efficient resource utilization. We
introduce LIBRA, a framework specifically focused on optimizing
multi-dimensional fabric architectures. Through case studies, we demonstrate
the value of LIBRA, both in architecting optimized fabrics under diverse
constraints and in enabling co-optimization opportunities.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要