MIC-Check: a distributed check pointing framework for the intel many integrated cores architecture.

HPDC(2014)

引用 11|浏览45
暂无评分
摘要
ABSTRACTThe advent of many-core architectures like Intel MIC is enabling the design of increasingly capable supercomputers within reasonable power budgets. Fault-tolerance is becoming more important with the increased number of components and the complexity in these heterogeneous clusters. Checkpoint-restart mechanisms have been traditionally used to enhance the dependability of applications, and to enable dynamic task rescheduling in the face of system failures. Naive checkpointing protocols, which are predominantly I/O-intensive, face severe performance bottlenecks on the Xeon Phi architecture due to several inherent and acquired limitations. Consequently, existing checkpointing frameworks are not capable of serving distributed MPI applications that leverage heterogeneous hardware architectures. This paper discusses the I/O limitations on the Xeon Phi system, and describes the architecture and design of a novel distributed checkpointing framework, namely MIC-Check, for HPC applications running on it.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要