Multiplying 2 x 2 Sub-Blocks Using 4 Multiplications

PROCEEDINGS OF THE 35TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, SPAA 2023（2023）

引用 0|浏览0

暂无评分

摘要

Fast parallel and sequential matrix multiplication algorithms switch to the cubic time classical algorithm on small sub-blocks as the classical algorithm requires fewer operations on small blocks. We obtain a newalgorithm that can outperform the classical one, even on small blocks, by trading multiplications with additions. This algorithm contradicts the common belief that the classical algorithm is the fastest algorithm for small blocks. To this end, we introduce commutative algorithms that generalize Winograd's folding technique (1968) and combine it with fast matrix multiplication algorithms. Thus, when a single scalar multiplication requires rho times more clock cycles than an addition (e.g., for 16-bit integers on Intel's Skylake microarchitecture, rho is between 1.5 and 5), our technique reduces the computation cost of multiplying the small sub-blocks by a factor of rho+3/2(rho +1) compared to using the classical algorithm, at the price of a low order term communication cost overhead both in the sequential and the parallel cases, thus reducing the total runtime of the algorithm. Our technique also reduces the energy cost of the algorithm. The rho values for energy costs are typically larger than the rho values for arithmetic costs. For example, we obtain an algorithm for multiplying 2 x 2 blocks using only four multiplications. This algorithm seemingly contradicts the lower bound of Winograd (1971) on multiplying 2x2 matrices. However, we obtain this algorithm by bypassing the implicit assumptions of the lower bound. We provide a new lower bound matching our algorithm for 2 x 2 block multiplication, thus showing our technique is optimal.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要