Systematically extending a high-level code generator with support for tensor cores.

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP)(2022)

引用 2|浏览20
暂无评分
摘要
High-level code generators like Halide, Lift, and RI SE make a compelling proposition: write programs in a simple high-level language and get high-performing GPU code "for free". They achieve this feat by restricting the input language to a specific domain (such as image and array processing in Halide) or to a fixed set of flexible parallel patterns (as Lift and RI SE do). Implementing high-level code generators that produce high-performance code is challenging, specifically as the target hardware constantly evolves. In this paper, we discuss how we systematically extend the RI SE high-level code generator with support for tensor cores, a specialized hardware feature of recent Nvidia GPUs. We highlight the design of RI SE that makes it easily extensible by following a systematic bottom-up approach, that first , exposes the imperative tensor core API to the code generator, then , raises the abstractions to an internal low-level functional representation, that, finally , is targeted by a rewrite process that starts from a high-level functional program. Our experimental evaluation shows that RI SE with support for tensor cores generates code of competitive performance to manually optimized CUDA code, which is only up to 36%, but on average only 10%, slower than Nvidia's highly optimized cuBLAS library, and clearly outperforms any code that does not exploit tensor cores.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要