A QoS-oriented Scheduling and Autoscaling Framework for Deep Learning

2019 International Joint Conference on Neural Networks (IJCNN)(2019)

引用 4|浏览127
暂无评分
摘要
Deep learning is popular in many areas, but users must manually specify the resource configuration when submitting deep learning training jobs, usually over-provisioning resources. This kind of unreasonable resource configuration method results in slow training and low resource utilization. Therefore, it would be more convenient and efficient if users only need to specify the quality of service (QoS) for their jobs, and then the resources will be autoconfigured to meet the QoS. To satisfy this demand, we present a QoS-oriented scheduling and autoscaling framework that schedules and autoscales deep learning training jobs in the Kubernetes cluster. This paper focuses on the most important QoS requirement for deep learning training jobs: deadline. The goal of the framework is to guarantee that as many jobs as possible can be accomplished before their specified deadlines. To reach this goal, the framework schedules deep learning jobs by implementing a heuristic scheduling policy based on resource status and job deadline, and autoscales resource configuration by exploiting a characteristic of deep learning jobs: the predictability of training time. This predictability is used to predict whether a job can be accomplished before its deadline and estimate appropriate resource configuration if necessary. We implemented the framework by modifying the default scheduler of Kubernetes and conducted experiments to evaluate its performance. The experiment results show that our scheduling policy can improve the completion rate by 26% when the cluster resources are insufficient, and our autoscaling policy can improve the completion rate to 100% when the cluster resources are sufficient. We also show that the framework improves the utilization of allocated CPUs to 100%. Our proposed framework points to a new way of submitting and managing deep learning training jobs in the cluster.
更多
查看译文
关键词
QoS-oriented scheduling,autoscaling framework,cluster resources,resource configuration,resource utilization,deep learning training jobs,quality of service,Kubernetes cluster
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要