Designing vector-friendly compact BLAS and LAPACK kernels

SC(2017)

引用 30|浏览123
暂无评分
摘要
Many applications, such as PDE based simulations and machine learning, apply blas/lapack routines to large groups of small matrices. While existing batched blas APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to blas/lapack routines that can be used within a hierarchical parallel application. Our layout provides up to 14X, 45X, and 27X speedup against OpenMP loops around optimized dgemm, dtrsm and dgetrf kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched blas/lapack implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.
更多
查看译文
关键词
hierarchical parallel application,Intel® Math Kernel Library,designing vector-friendly compact BLAS,LAPACK kernels,PDE based simulations,machine learning,BLAS APIs,meaningful speedup,problem type,noncanonical data layout enabling cross-matrix vectorization,significant speedup,compact data layout,interleaves matrices,SIMD vector length
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要