Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems

Jidong Zhai,Liyan Zheng,Jinghan Sun,Feng Zhang,Xiongchao Tang,Xuehai Qian,Bingsheng He,Wei Xue,Wenguang Chen,Weimin Zheng

IEEE Transactions on Parallel and Distributed Systems（2022）

引用 1|浏览78

暂无评分

摘要

Variations in the performance of parallel and distributed systems are becoming increasingly challenging. The runtimes of different executions can vary greatly even with a fixed number of computing nodes. Many HPC applications on supercomputers exhibit such variance. This not only leads to unpredictable execution times, but also renders the system’s behavior unintuitive. The efficient online detection of variations in performance is an open problem in HPC research. To solve it, we propose an approach, called vSensor , to detect variations in the performance of systems. The key finding of this study is that the source code of programs can better represent performance at runtime than an external detector. Specifically, many HPC applications contain code snippets that are fixed workload patterns of execution, e.g., the workload of an invariant quantity and a linearly growing workload. This observation allows us to automatically identify these snippets of workload-related code and use them to detect variations in performance. We evaluate vSensor on the Tianhe-2A system with a large number of parallel applications, and the results indicate that it can efficiently identify variations in system performance. The average overhead of 4,096 processes is less than 6% for fixed-workload v-sensors. We identify a problematic node with slow memory by using vSensor that degrades the performance of the program by 21%. A serious issue with network performance is also detected that slows down the Tianhe-2A system by 3.37 times for an HPC kernel.

查看译文

关键词

Parallel computing,performance variance,HPC,performance analysis,performance optimization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要