From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell

Cong Li, Yu Zhang, Jialei Wang, Hang Chen,Xian Liu,Tai Huang, Liang Peng,Shen Zhou, Lixin Wang,Shijian Ge

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis(2022)

引用 5|浏览3
暂无评分
摘要
Uncorrectable memory errors are one of the major failure causes in datacenters. In this paper, we present an empirical study correlating correctable errors (CEs) and uncorrectable errors (UEs) using the large-scale field data across 3 major dual in-line memory module (DIMM) manufacturers from a contemporary server farm of ByteDance. Different from the previous studies, our study is the first to comprehend the error-bit information of CEs and the DIMM part numbers. Unlike the traditional chipkill error correction code (ECC), in contemporary Intel server platforms the ECC gets weakened, not able to tolerate some error-bit patterns from a single chip. Using obtainable coarse-grained ECC knowledge, we derive a new indicator from the error-bit information: risky CE occurrence in terms of ECC guaranteed coverage. From the data, we show that the new indicator has a consistently high sensitivity and specificity in the test of future UE occurrences across DIMMs from different manufacturers. This leads us to conjecture that the weakened ECC substantially contributes to many UEs today. The new risky CE indicator is then applied in predicting the future UE occurrence based on the CE history. We empirically demonstrate how practically useful predictors are constructed in conjunction with other useful attributes such as certain micro-level fault indicators and DIMM part numbers, achieving the state-of-the-art performance.
更多
查看译文
关键词
Memory reliability,error correction code,risky errors,uncorrectable error prediction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要