Enabling prefix sum parallelism pattern for recurrences with principled function reconstruction

Proceedings of the 28th International Conference on Compiler Construction（2019）

引用 6|浏览56

暂无评分

摘要

Much research work has been done to parallelize loops with recurrences over the last several decades. Recently, sampling-and-reconstruction method was proposed to parallelize a broad class of loops with recurrences in an automated fashion, with a practical runtime approach. Although the parallelized codes achieve linear scalability across multi-cores architectures, the sequential merge inherent to this method makes it not scalable on many core architectures, such as GPUs. At the same time, existing parallel merge approaches used for simple reduction loops cannot be directly and correctly applied to this method. Based on this observation, we propose new methods to merge partial results in parallel on GPUs and achieve linear scalability. Our approach involves refined runtime-checking rules to avoid unnecessary runtime check failures and reduce the overhead of reprocessing. We also propose sample converge technique to reduce the number of sample points so that communication and computation overhead is reduced. Finally, based on GPU architectural features, we develop optimization techniques to further improve performance. Our evaluation results of a set of representative algorithms show that our parallel merge implementation is substantially more efficient than sequential merge, and achieves linear scalability on different GPUs.

查看译文

关键词

GPU optimizations, communication, parallel merge, prefix-sum, runtime check, sample-and-reconstruction

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要