Reliability Lessons Learned From Gpu Experience With The Titan Supercomputer At Oak Ridge Leadership Computing Facility

Devesh Tiwari,Saurabh Gupta, George Gallarno, Jim Rogers,Don Maxwell

SC(2015)

引用 95|浏览83
暂无评分
摘要
The high computational capability of graphics processing units (GPUs) is enabling and driving the scientific discovery process at large-scale. The world's second fastest supercomputer for open science, Titan, has more than 18,000 GPUs that computational scientists use to perform scientific simulations and data analysis. Understanding of GPU reliability characteristics, however, is still in its nascent stage since GPUs have only recently been deployed at large-scale. This paper presents a detailed study of GPU errors and their impact on system operations and applications, describing experiences with the 18,688 GPUs on the Titan supercomputer as well as lessons learned in the process of efficient operation of GPUs at scale. These experiences are helpful to HPC sites which already have large-scale GPU clusters or plan to deploy GPUs in the future.
更多
查看译文
关键词
reliability lessons,Titan supercomputer,Oak Ridge leadership computing facility,GPU computational capability,graphics processing units,scientific simulations,data analysis,GPU errors,system operations,GPU system failures
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要