Hydra: Deadline-Aware and Efficiency-Oriented Scheduling for Deep Learning Jobs on Heterogeneous GPUs

IEEE Transactions on Computers(2023)

引用 3|浏览4
暂无评分
摘要
With the rapid proliferation of deep learning (DL) jobs running on heterogeneous GPUs, scheduling DL jobs to meet various scheduling requirements, such as meeting deadlines and reducing job completion time (JCT), is critical. Unfortunately, existing efficiency-oriented and deadline-aware efforts are still rudimentary. They lack the capability of scheduling jobs to meet deadline requirements while reducing total JCT, especially when the jobs have various execution times on heterogeneous GPUs. Therefore, we present Hydra, a novel quantitative cost comparison approach, to address this scheduling issue. Here, the cost represents the total JCT plus a dynamic penalty calculated from the total tardiness (i.e., the delay time of exceeding the deadline) of all jobs. Hydra adopts a sampling approach that exploits the inherent iterative periodicity of DL jobs to estimate job execution times accurately on heterogeneous GPUs. Then, Hydra considers various combinations of job sequences and GPUs to obtain the minimized cost by leveraging an efficient branch-and-bound algorithm. Finally, the results of evaluation experiments on Alibaba traces show that Hydra can reduce total tardiness by 85.8% while reducing total JCT as much as possible, compared with state-of-the-art efforts.
更多
查看译文
关键词
Deadline-aware scheduler, deep learning, GPU cluster
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要