Scheduling ML training on unreliable spot instances.

UCC Companion(2021)

引用 1|浏览13
暂无评分
摘要
Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linearprogramming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.
更多
查看译文
关键词
Scheduling, Machine Learning Training, Spot Instances
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要