Cambricon-P: A Bitflow Architecture for Arbitrary Precision Computing

Yifan Hao,Yongwei Zhao, Chenxiao Liu,Zidong Du, Shuyao Cheng,Xiaqing Li,Xing Hu,Qi Guo,Zhiwei Xu,Tianshi Chen

2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)（2022）

引用 2|浏览11

暂无评分

摘要

Arbitrary precision computing (APC), where the digits vary from tens to millions of bits, is fundamental for scientific applications, such as mathematics, physics, chemistry, and biology. APC on existing platforms (e.g., CPUs and GPUs) is achieved by decomposing the original data into small pieces to accommodate to the low-bitwidth (e.g., 32-/64-bit) functional units. However, such fine-grained decomposition inevitably introduces large amounts of intermediates, bringing in intensive on-chip data traffic and long, complex dependency chains, so that causing low hardware utilization.To address this issue, we propose Cambricon-P, a bitflow architecture supporting monolithic large and flexible bitwidth operations for efficient APC processing, which avoids generating large amounts of intermediates from decomposition. Cambricon- P features a tightly-integrated computational architecture for processing different bitflows in parallel, where full bit-serial data paths are deployed. The bit-serial scheme still needs to eliminate the dependency chain of APC for exploiting parallelism within one monolithic large-bitwidth operation. For this purpose, Cambricon-P adopts a carry parallel computing mechanism, which enables recursively transforming the multiplication into smaller inner-products that can be performed in parallel between bit-indexed IPUs (Inner-Product Units). Furthermore, to improve the computing efficiency of APC, Cambricon- P employs a bit-indexed inner-product processing scheme, namely BIPS, to eliminate intra-IPU bit-level redundancy. Compared to Intel Xeon 6134 CPU, Cambricon-P achieves 100.98$\times$ performance on monolithic long multiplication, and 23.41$\times$/30.16$\times$ speedup and energy benefit over four real-world APC applications on average. Compared to NVidia V100 GPU, Cambricon-P also delivers the same throughput, as well as 430$\times$/60.5$\times$ lesser area and power, respectively, on batch-processing multiplications.

查看译文

关键词

arbitrary precision,bitflow architecture,bit-serial

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要