iCheck: Leveraging RDMA and Malleability for Application-Level Checkpointing in HPC Systems

Jophin John, Isaac David Núñez Araya,Michael Gerndt

2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS)(2023)

引用 0|浏览12
暂无评分
摘要
The estimate that the mean time between failures will be in minutes in exascale supercomputers should be alarming for application developers. The inherent system’s complexity, millions of components, and susceptibility to failures make checkpointing more relevant than ever. Since most high performance scientific applications contain an in-house checkpoint restart mechanism, their performance can be impacted by the contention of parallel file system resources. A shift in checkpointing strategies is needed to thwart this behavior. With iCheck, we present a novel checkpointing framework that supports malleable multilevel application-level checkpointing. We employ an RDMA enabled configurable multi-agent-based checkpoint transfer mechanism where minimal application resources are utilized for checkpointing. The high-level API of iCheck facilitates easy integration and malleability. We have added the iCheck library into the Is1 mardyn application providing performance improvement up to five thousand times over the in-house checkpointing mechanism. LULESH, Jacobi 2D heat simulation, and a synthetic application were also used for extensive analysis.
更多
查看译文
关键词
Fault Tolerance,Adaptive Checkpointing,RDMA,Malleable Checkpointing,MPI
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要