Exploring reliability of exascale systems through simulations

Dongfang Zhao,Da Zhang,Ke Wang,Ioan Raicu

Proceedings of the High Performance Computing Symposium（2013）

引用 0|浏览0

暂无评分

摘要

Exascale computers are predicted to emerge by the end of this decade with millions of nodes and billions of concurrent cores/threads. One of the most critical challenges for exascale computing is how to effectively and efficiently maintain the system reliability. Checkpointing is the state-of-the-art technique for high-end computing system reliability that has proved to work well for current petascale scales. This paper investigates the suitability of checkpointing mechanism for exascale computers, across both parallel filesystems and distributed filesystems. We built a model to emulate exascale systems, and developed a simulator, RXSim, to study its reliability and efficiency. Experiments show that the overall system efficiency and availability would go towards zero as system scales approach exascale with checkpointing mechanism on parallel filesystems. However, the simulations suggest that a distributed filesystem with local persistent storage would offer excellent scalability and aggregate bandwidth, enabling efficient checkpointing at exascale.

查看译文

关键词

checkpointing mechanism,parallel filesystems,Exascale computer,exascale computing,exascale system,high-end computing system reliability,overall system efficiency,system reliability,system scale,efficient checkpointing,Exploring reliability

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要