Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

Rendong Liang,Ting Cao,Jicheng Wen,Manni Wang,Yang Wang,Jianhua Zou,Yunxin Liu

ACM International Conference on Mobile Computing and Networking（2022）

引用 6|浏览36

暂无评分

摘要

Mobile GPU, as a ubiquitous and powerful accelerator, plays an important role in accelerating on-device DNN (Deep Neural Network) inference. The frequent-upgrade and diversity of mobile GPUs require automatic kernel generation to empower fast DNN deployment. However, current generated kernels have poor performance. The goal of this paper is to rapidly generate high-performance kernels for diverse mobile GPUs. The major challenges are (1) it is unclear about what is the optimal kernel due to the lack of hardware knowledge; (2) how to rapidly generate it from a large space of candidates. For the first challenge, we propose a cross-platform profiling tool, the first to disclose and quantify mobile GPU architecture. The result demystifies the hardware bottleneck, and also directs the solution for the second challenge by exposing the unique hardware feature, identifying inefficient kernels against hardware constraints, and specifying performance bound for kernels. Directed by that, we propose a mobile-GPU-specific kernel compiler Romou. It supports the unique hardware feature in kernel implementation, and prunes inefficient ones against hardware resources. Romou can thus rapidly generate high-performance kernels. Compared to the state-of-the-art generated kernels, it achieves up-to 14.7x speedup on average for convolution. Up-to 99% search space is pruned. The performance is even up-to 1.2x faster on average than the state-of-the-art hand-optimized implementation.

查看译文

关键词

Mobile GPU, Automatic Code Generation, Architecture Profiling, Deep Neural Networks

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要