623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores.

IJHPCA(2016)

引用 26|浏览63
暂无评分
摘要
AbstractIn this article, we present a new hybrid algorithm to enable and scale the high-performance conjugate gradients HPCG benchmark on large-scale heterogeneous systems such as the Tianhe-2. Based on an inner-outer subdomain partitioning strategy, the data distribution between host and device can be balanced adaptively. The overhead of data movement from both the MPI communication and the PCI-E transfer can be significantly reduced by carefully rearranging and fusing operations. A variety of parallelization and optimization techniques for performance-critical kernels are exploited and analyzed to maximize the performance gain on both host and device. We carry out experiments on both a small heterogeneous computer and the world's largest one, the Tianhe-2. On the small system, a thorough comparison and analysis has been presented to select from different optimization choices. On Tianhe-2, the optimized implementation scales to the full-system level of 3.12 million heterogeneous cores, with an aggregated performance of 623 Tflop/s and a parallel efficiency of 81.2%.
更多
查看译文
关键词
Tianhe-2, HPCG, conjugate gradients, MIC, heterogeneous computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要