Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters.

IEEE Transactions on Dependable and Secure Computing(2018)

引用 20|浏览72
暂无评分
摘要
Availability of the interconnection network in high-performance computing (HPC) systems is fundamental to sustaining the continuous execution of applications at scale. When failures occur, interconnect recovery mechanisms orchestrate complex operations to recover network connectivity between the nodes. As the scale and design complexity of HPC systems increase, so does the system's susceptibility ...
更多
查看译文
关键词
Data security,Network security,Fault tolerance,Fault diagnosis,Multiprocessor interconnection,Data analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要