Particle track reconstruction on heterogeneous platforms with SYCL

IWOCL '23: Proceedings of the 2023 International Workshop on OpenCL（2023）

引用 0|浏览1

暂无评分

摘要

With the SYCL programming model comes the promise of relatively easy parallel and accelerated code development as well as out-of-the-box portability between various hardware platforms from different vendors. One of the areas which can highly benefit from this kind of characteristics of the programming model is particle physics experiments, where large amounts of data need to be processed on multiple stages by a wide variety of algorithms of different profiles. Such a data processing pipeline is often required to consume streaming data from the detectors in an online manner. Modern hardware platforms, accelerators, and their increasing performance are an opportunity for collaborations to collect and analyze more data, more effectively and with better accuracy. On the other hand, building a complex software stack by teams with a limited number of developers becomes more and more challenging in a multi-vendor landscape and with new programming models and APIs emerging. As the physics experiments are designed and computing solutions evaluated many years ahead of the actual run, there is also a need for the codebase of this kind of scientific software to be future-proof, e.g., being able to run on a next-generation computing cluster that uses GPU accelerators from different vendors or entirely different platforms like upcoming powerful APU devices. In this project, we begin with a simple single-threaded implementation of particle track reconstruction algorithm proposed for one of the subdetectors in the PANDA experiment being under development as a part of the FAIR Facility at GSI, Darmstadt, Garmany. We start with a task to port the algorithm to SYCL with minimal effort, I.e., trying to keep the kernel code as close to the original implementation as possible, while attempting to maintain good parallelization and competitive performance in an accelerated environment. After many iterations, experimentation with different memory layouts as well as various approaches to express parallelism and data flow to tame the memory-bounded characteristics of the algorithm, we came up with a final version, that’s still similar in terms of code structure to the original implementation and can achieve satisfying performance across all kinds of different targets. This ultimate implementation, comprising 7 kernels and multiple auxiliary accelerated functions, was evaluated using major SYCL implementations: hipSYCL and DPC++. Benchmarks were conducted on a wide variety of platforms from leading vendors including NVIDIA V100, NVIDIA A100, and AMD MI250 GPUs as well as AMD EPYC Rome and Intel Cascade Lake CPUs, and finally AMD/Xilinx Alveo U280 FPGA accelerator card. For the latter, an experimental AMD/Xilinx compiler based on Intel’s LLVM version was used. We also compare the performance with CUDA implementation built in the same manner as the final SYCL one, showing that it can achieve performance comparable to the native version. We show that developing performant and portable code with truly single source code for CPU and GPU is possible and accessible for developers with an intermediate understanding of parallelization and how to effectively interact with GPU-based accelerators. Finally, for more exotic types of devices, like FPGA-based accelerators, some host code modifications are required to successfully compile and execute the software on such platforms. While not competitive in terms of performance, we believe that the ability to run this kind of algorithm on FPGA without significant adjustments is an achievement in itself.

查看译文

关键词

track,heterogeneous platforms,particle

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要