CHOPPER: A Compiler Infrastructure for Programmable Bit-serial SIMD Processing Using Memory in DRAM.

HPCA(2023)

引用 1|浏览25
暂无评分
摘要
Increasing interests in Bit-serial SIMD ProcessingUsing-DRAM (PUD) architectures amplify the needs for a compiler to automate code generation, credited to their ultrawide SIMD width and reduction of data movements. The stateof-the-art Bit-serial SIMD PUD architectures (1) only provide assembly SIMD programming interfaces, which heavily saddles with programmers to exploit the ultra-wide SIMD width on these architectures; and (2) encapsulate 1-bit operations into multi-bit abstractions, which incurs a granularity mismatch and restricts the optimization space to minimize data movements. We present CHOPPER, a new compiler infrastructure to make Bit-serial SIMD PUD more programmable and efficient. For the better programmability, the design of CHOPPER (1) exploits bit-slicing compilers to enable automatic memory allocation and code generation, from naturally-expressive codes (i.e. similar to Parallel Haskell) into the "SIMD-Within-A-Register"-style codes; and (2) introduces a new abstraction called "Virtual Code Emitter", to make Bit-serial SIMD PUD architecture exploit Memory-Level Parallelism (i.e. Bank or Subarray) more effectively. For the better efficiency, we propose three novel optimizations for CHOPPER to better exploit the potentials of Bit-serial SIMD PUD architectures, which (1) minimize the amount of intrasubarray data movements; and (2) mitigate the overheads of spilling data outside Bit-serial SIMD PUD architectures. These optimizations can greatly improve the overall efficiency of Bitserial SIMD PUD architectures. We also discuss (1) the limitations of the current CHOPPER; and (2) the potentials of CHOPPER for other types of Processing-In-Memory architectures. We evaluate CHOPPER by hosting it on three state-of-the-art Bit-serial SIMD PUD architectures. We compare CHOPPER-generated codes against the state-of-the-art hands-tuned codes for Bit-serial SIMD PUD architectures. We highlight that, averaged across 16 real-world workloads from 4 PUD-friendly application domains, CHOPPER achieves (A) 1.20X, 1.29X and 1.26X speedup when data can fit within DRAM subarrays; and (B) 12.61X, 9.05X and 9.81X speedup when data need to spill to the secondary storage, on Ambit [50], ELP2IM [56] and SIMDRAM [22], compared with hands-tuned codes using the state-of-the-art methodology [22] for Bit-serial SIMD PUD architectures. These performance benefits also accompany with a great reduction of Lines-of-Codes (LoC) in CHOPPER (i.e. by 4.3X less LoCs for hands-tuning a single subarray, and >10(3)X less for hands-tuning all subarrays in a rank). We also perform breakdown and sensitivity studies of CHOPPER, to better understand its source benefits and examine its robustness under various architectural features.
更多
查看译文
关键词
Bit-serial SIMD Processing-Using-DRAM architectures,bit-serial SIMD PUD architectures,bit-slicing compilers,CHOPPER,programmable Bit-serial SIMD Processing,ultra-wide SIMD
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要