Towards performance portability of AI models using SYCL-DNN.

Muhammad Tanvir,Kumudha Narasimhan,Mehdi Goli, Ouadie El Farouki,Svetlozar Georgiev,Isaac Ault

International Workshop on OpenCL (IWOCL)（2022）

引用 1|浏览1

暂无评分

摘要

The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and supercomputers. This huge diversity soon becomes a burden due to the emerging dependencies between development stacks and deployment hardware. While the proposed ONNX as a de facto for AI model description, provides the portability of AI models across various AI frameworks, supporting DNN models on various hardware architectures remains challenging. Several existing AI frameworks such as Tensorflow, Pytorch, ONNXRuntime provides performance portability via a dedicated backend implementations per hardware architecture. While such approach provides wider support of hardware devices, maintainability and readability remains challenging. There are many libraries and frameworks which were developed to support neural network models and we discuss some of the important ones in this section. Frameworks like Glow [18], nGraph [14] and Tensor Comprehensions [19] use a compiler-based approach to accept the neural network model and emit optimised code for a specific hardware. The neural network model is lowered into one or more intermediate representations before generating an optimised kernel. These frameworks, target a specific set of backends and targeting any new hardware requires implementing a considerable fraction of the operators. Other frameworks like Caffe [16], PyTorch [17] and TinyNN [10] provide runtime solution by integrating various vendor specific libraries or graph as a backend to support neural network models on different set of architectures. Framework like TensorFlow [11], rely on calling vendor-specific libraries or graph compilers. While embedding vendor-specific library can lead to achieving near metal performance, it can make adding and maintaining different backends quite tedious. Intel oneMKL [4] and oneDNN [7] are the optimized libraries for linear algebra subroutine and deep neural network routines for multi-core and manycore Intel systems. Recently, oneMKL and oneDNN have added support for running on Nvidia GPUs as well [15] via SYCL interoperability with third party libraries. This approach integrates the existing vendor optimised backend in SYCL to provide a unique SYCL-interface for memory management and runtime control from the user point of view while reusing the highly optimised vendor backend. ARM Compute Library [1], cuBLAS [6] and cuDNN [13], MIOpen [5] provides optimised routines for linear algebra and machine learning for ARM, Nvidia and AMD respectively. All these libraries are optimised for specific architectures, and very rarely provide portability. SYCL provides a C++-based portable parallel programming model to target various devices like CPUs, GPUs, DSPs, FPGAs, etc. SYCL programming model allows the developers to write highly parametrized kernels for a diverse hardware set in a unified setting. These kernels can then be tuned for the specified hardware. Hence, enabling SYCL backend for an AI framework (e.g. Tensorflow, Pytorch etc.) can lead to a hardware agnostic model for heterogeneous systems and also allow to reuse the existing optimized library implementations. Libraries like SYCL-BLAS [8] and SYCL-DNN [9] are open source and are a part of the SYCL eco-system. They can be compiled with any SYCL compiler such as ComputeCPP [2] or DPC++ [3] and run on any SYCL-enabled devices. ComputeCPP also supports SYCL RISC-V architecture [12]. Thus, making the applications using these libraries sufficiently portable. The SYCL kernels implemented in SYCL-DNN and SYCL-BLAS allow for tuning parameters such as cache size, work group size and local memory size based on the hardware we execute on. This helps reuse the existing kernels but still provide good performance on new hardware by tuning for these finer details. SYCL-DNN already supports OpenCL backend and in this paper we extend SYCL-DNN to support Nvidia and RISC-V architectures. Figure 1 shows the NN operation mapping. The results provide a detailed analysis of the performance portability of SYCL based AI frameworks on various architectures with respect to state-of-the-art optimized vendor specific libraries. The performance of SYCL-DNN is in-par on existing OpenCL backends as compared to device specific optimized libraries. We run the VGG model to understand and compare performance (Table 1). In case of Intel GPU - HD Graphics 530, SYCL-DNN provides 80% the performance of optimized oneDNN execution provider. The performance gap in this case is because of the extra graph optimizations that oneDNN performs. For Intel CPU, we used ComputeCPP 2.4 as the SYCL compiler and the latest oneDNN from github repository. We observe that SYCL-DNN performs 19% slower than oneDNN. However, tuning the matmul operation in SYCL-DNN provides a considerable speedup and SYCL-DNN performance 37% better than oneDNN. We extend SYCL-DNN to support DPC++ as one of the SYCL compilers. DPC++ provides support for CUDA backend, there by enabling running SYCL kernels on Nvidia devices. We compare the performance with optimized cuDNN libraries. We used the latest DPC++ as the SYCL compiler and cuDNN version 7.6.5. We see that untuned SYCL-DNN is almost 50% slower than cuDNN. This is because the matmul implementation in SYCL-DNN does not optimize for local memory. Further tuning and using the optimized SYCL-BLAS implementation of matmul improves the performance and we observe that SYCL-DNN comes within 90% of the performance of cuDNN. cuDNN has hand-written optimized implementation of some of the routines and hence achieves 10% more performance then SYC-DNN but, the code written in cuDNN cannot be reused on any other hardware. Further, there are no Execution providers / frameworks which provide full support for RISC-V architectures. By integrating SYCl-DNN with the Acoran compute stack, we are able to support generating RISC-V ISA. The Acoran compute stack uses ComputeCPP and ComputeAorta to enable running SYCL Kernels on RISC-V architectures. We run the VGG-16 model on the RISC-V Spike simulator. The current implementation of the simulator is single core and hence VGG-16 takes 312 seconds and ResNet-50 takes 198 seconds to complete execution. In case of VGG-16, the simulator requires 16532358513 cycles to finish execution. We are enabling SYCL backend for ONNXRuntime as a future work to exploit the benefit of ONXX model loader to load the ONXX model from different AI framework and also benefit from ONXX runtime graph optimisation.

查看译文

关键词

ai models,performance portability,sycl-dnn

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要