Fast Arbitrary Precision Floating Point on FPGA

2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)(2022)

引用 1|浏览21
暂无评分
摘要
Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compile-time fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dual-socket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8× speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3× speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351× and 191× CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10× speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375× CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLS-based code provides as high-level software interface for plug-and-play acceleration, published as an open source project.
更多
查看译文
关键词
fast arbitrary Precision Floating Point,numerical codes,arbitrary precision floating point numbers,core computation,elementary arithmetic operations,mantissa bits,APFP computations,conventional software-based architectures,native hardware support,elementary operations,machine-word-sized blocks,APFP multiplication,compile-time fixed-precision operands,deep FPGA pipelines,recursively defined Karatsuba decomposition,native DSP multiplication,Alveo U250 accelerator,dual-socket 36-core Xeon node,GNU Multiple Precision Floating-Point Reliable,4.8 GOp,512-bit multiplication,1024-bit multiplication,CPU cores,general matrix-matrix multiplication,2.0 GOp,single FPGA,configurable HLS-based code,flexible HLS-based code
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要