Understanding the Propagation of Error Due to a Silent Data Corruption in a Sparse Matrix Vector Multiply

Cluster Computing(2015)

引用 13|浏览81
暂无评分
摘要
With the rate of errors that silently effect an application's state/output expected to increase in future HPC machines, numerous mitigation schemes have been proposed, but little work has been done investigating why these schemes detect some error while other is masked. This paper investigates how silent data corruption (SDC) propagates through a sparse matrix vector multiply (SpMV), a fundamental HPC computation kernel. We discover that analyzing the mathematics of the SpMV limits understanding of SDC propagation. We achieve a more complete understanding by investigating how SDC propagates in a SpMV as it is expressed in machine instructions.
更多
查看译文
关键词
Silent Data Corruption, Error Propagation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要