Interference-aware opportunistic job placement for shared distributed deep learning clusters

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING(2024)

引用 0|浏览4
暂无评分
摘要
Distributed deep learning frameworks facilitate large deep learning workloads. These frameworks support sharing one GPU device among multiple jobs to improve resource utilization. Modern deep learning training jobs consume a large amount of GPU memory. Despite that, sharing GPU memory among jobs is still possible because a training job has iterative steps that its memory usage fluctuates over time. However, resource sharing also introduces the risk of job performance degradation. Co-located jobs sharing a GPU device may suffer from different levels of interference, mainly caused by memory oversharing. How to improve resource utilization while maintaining good job performance is a novel challenge for job placement strategies. This paper studies the job placement problem. We propose an opportunistic memory sharing model to describe the time-varying job memory requirements. Based on this model, we introduce an Opportunistic Job Placement Problem (OJPP) for shared GPU clusters that seek job placement configurations using a minimum number of GPU devices and guarantee user-defined performance requirements at the same time. We propose a greedy algorithm and a heuristic algorithm with computational complexities of O(n log n) and O(n2log n), respectively, to solve the problem. We also propose an online adjustment algorithm with the computational complexity of O(n log n) to perform updates to job placement configurations in runtime. A machine-learning-based interference prediction method is used to prepare accurate interference estimations. Extensive experiments are conducted on a GPU cluster to verify the correctness and effectiveness of our algorithms. Compared with standalone training jobs on dedicated clusters, the proposed approach reduces resource consumption by 46% in a shared cluster, while guaranteeing over 92.97% of the job performance, in terms of average job completion time.
更多
查看译文
关键词
Deep learning cluster,Interference,Online adjustment,Opportunistic sharing,Job placement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要