Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Qiyang Ding, Pengfei Zheng,Shreyas Kudari,Shivaram Venkataraman,Zhao Zhang

SC23: International Conference for High Performance Computing, Networking, Storage and Analysis(2023)

引用 0|浏览25
暂无评分
摘要
Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient to design a proactive provisioner using production job traces from three GPU clusters. We follow the standard machine learning practice by partitioning each job trace into training and validation subsets, then train each model using the training subset and evaluate the generality using the validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate RL methods. Our experiments show that the Mirage can reduce the interruption by 17-100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.
更多
查看译文
关键词
GPU Cluster,Deep Learning,Random Forest,XGBoost,Policy Gradient,Deep Q-network,Wall-clock Time,Reinforcement Learning Techniques,Deep Learning Training,Deep Learning Research,Deep Network,Deep Neural Network,Transition State,State Space,Ensemble Method,Language Model,Waiting Time,Transformer Model,Policy Learning,Gradient Boosting Decision Tree,Ensemble Learning Method,Foundation Model,Job Completion Time,Policy Network,Reinforcement Learning Methods,Policy Gradient Method,Deep Q-learning,Medium Load,Single Job,Light Load
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要