NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions
arxiv(2024)
摘要
Field-Programmable Gate Array (FPGA) accelerators have proven successful in
handling latency- and resource-critical deep neural network (DNN) inference
tasks. Among the most computationally intensive operations in a neural network
(NN) is the dot product between the feature and weight vectors. Thus, some
previous FPGA acceleration works have proposed mapping neurons with quantized
inputs and outputs directly to lookup tables (LUTs) for hardware
implementation. In these works, the boundaries of the neurons coincide with the
boundaries of the LUTs. We propose relaxing these boundaries and mapping entire
sub-networks to a single LUT. As the sub-networks are absorbed within the LUT,
the NN topology and precision within a partition do not affect the size of the
lookup tables generated. Therefore, we utilize fully connected layers with
floating-point precision inside each partition, which benefit from being
universal function approximators, with rigid sparsity and quantization enforced
only between partitions, where the NN topology becomes exposed to the circuit
topology. Although cheap to implement, this approach can lead to very deep NNs,
and so to tackle challenges like vanishing gradients, we also introduce skip
connections inside the partitions. The resulting methodology can be seen as
training DNNs with a specific sparsity pattern that allows them to be mapped to
much shallower circuit-level networks, thereby significantly improving latency.
We validate our proposed method on a known latency-critical task, jet
substructure tagging, and on the classical computer vision task, the digit
classification using MNIST. Our approach allows for greater function
expressivity within the LUTs compared to existing work, leading to lower
latency NNs for the same accuracy.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要