Latency-Based Inter-Operator Scheduling for CNN Inference Acceleration on GPU.

Yukai Ping,He Jiang , Xingxiang Liu, Zhenyang Zhao,Zhide Zhou,Xin Chen

IEEE Trans. Serv. Comput.(2024)

引用 0|浏览3
暂无评分
摘要
Convolutional Neural Networks (CNNs) are widely deployed on the Graphics Processing Unit (GPU) to support Deep Learning (DL) based services. Popular DL frameworks usually ignore the inter-operator parallelism when executing the inference of CNNs, which results in high inference latency. Although some inter-operator scheduling methods have been proposed, there remains a critical trade-off issue between inference latency (effectiveness) and scheduling time (efficiency). In this paper, we propose LIOS, a novel latency-based heuristic inter-operator scheduling method to balance inference latency and scheduling time. In LIOS, a CNN latency model is built based on the given CNN and GPU. Then every operator is assigned a priority value to represent its importance. During each iteration of the scheduling process, LIOS identifies the current data-independent operators, selects the operator with the highest priority value, and assigns it to the GPU stream with the smallest finish time. Extensive experimental results have demonstrated the effectiveness and efficiency of LIOS. For the effectiveness, LIOS can speed up the inference of normal-size and large-size CNNs by 1.13 $\sim 1.59\times$ compared to sequential scheduling. This result is comparable to IOS, the latest state-of-the-art scheduling method. For the efficiency, LIOS can speed up the scheduling process by 7 $\sim 9210\times$ compared to IOS.
更多
查看译文
关键词
Convoluitonal neural network,deep learning,inference acceleration,inter-operator scheduling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要