GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability

The International Conference for High Performance Computing, Networking, Storage, and Analysis(2020)

引用 12|浏览11
暂无评分
摘要
The Cray XK7 Titan was the top supercomputer system in the world for a long time and remained critically important throughout its nearly seven year life. It was an interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 years of GPU lifetimes during Titan's 6-year-long productive period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the cooling architecture and job scheduling. We describe the history, data collection, cleaning, and analysis and give recommendations for future supercomputing systems. We make the data and our analysis codes publicly available.
更多
查看译文
关键词
GPU,reliability,supercomputer,NVIDIA,Cray,large-scale systems,log analysis,MTBF,Kaplan-Meier survival,Cox regression,GPU failure data set
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要