Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation

ETS(2023)

引用 0|浏览1
暂无评分
摘要
Graphics Processing Units (GPUs) are essential in High Performance Computing (HPC) and safety-critical applications like autonomous vehicles. This market shift led to significant improvements in the programming frameworks and evaluation tools and concerns about their reliability. However, GPUs' high complexity poses challenges in evaluating their reliability. We conducted the first cross-layer GPU reliability evaluation to unveil and mitigate GPU vulnerabilities. The proposed evaluation is achieved by comparing and combining extensive neutron beam experiments, fault simulation campaigns, and application profiling. Based on this detailed analysis, a novel methodology to accurately estimate GPUs application FIT rate is proposed. The cross-layer evaluation enables two novel hardening solutions: (1) Reduced Precision Duplication With Comparison (RP-DWC) executes a redundant copy in reduced precision. RP-DWC delivers excellent fault coverage, up to 86%, with minimal execution time and energy consumption overheads (13% and 24%, respectively). (2) Dedicated software solutions for hardening Convolutional Neural Networks (CNNs) can detect up to 98% of errors.
更多
查看译文
关键词
GPU,reliability,fault tolerance,neutron induced errors,radiation experiments,machine learning,HPC
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要