Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition

Adnan Hoque,Less Wright,Chih-Chieh Yang,Mudhakar Srivatsa,Raghu Ganti

CoRR（2024）

引用 0|浏览7

暂无评分

摘要

We propose an implementation of an efficient fused matrix multiplication kernel for W4A16 quantized inference, where we perform dequantization and GEMM in a fused kernel using a SplitK work decomposition. Our implementation shows improvement for the type of skinny matrix-matrix multiplications found in foundation model inference workloads. In particular, this paper surveys the type of matrix multiplication between a skinny activation matrix and a square weight matrix. Our results show an average of 65 and an average of 124 range of matrix dimensions including those found in a llama-style model, where m < n = k.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要