FACT: FFN-Attention Co-optimized Transformer Architecture with Eager Correlation Prediction

Yubin Qin,Yang Wang,Dazheng Deng,Zhiren Zhao,Xiaolong Yang,Leibo Liu,Shaojun Wei,Yang Hu,Shouyi Yin

PROCEEDINGS OF THE 2023 THE 50TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, ISCA 2023（2023）

引用 0|浏览108

暂无评分

摘要

Transformer model is becoming prevalent in various AI applications with its outstanding performance. However, the high cost of computation and memory footprint make its inference inefficient. We discover that among the three main computation modules in a Transformer model (QKV generation, attention computation, FFN), it is the QKV generation and FFN that contribute to the most power cost. While the attention computation, focused by most previous works, only has decent power share when dealing with extremely long inputs. Therefore, in this paper, we propose FACT, an efficient algorithm-hardware co-design optimizing all three modules of Transformer. We first propose an eager prediction algorithm which predicts the attention matrix before QKV generation. It further detects the unnecessary computation in QKV generation and assigns mixed-precision FFN with the predicted attention, which helps improve the throughput. Further, we propose FACT accelerator to efficiently support eager prediction with three designs. It avoids the large overhead of prediction by using log-based add-only operations for prediction. It eliminates the latency of prediction through an out-of-order scheduler that makes the eager prediction and computation work in full pipeline. It additionally avoids memory access conflict in the mixed-precision FFN with a novel diagonal storage pattern. Experiments on 22 benchmarks show that our FACT improves the throughput of the whole Transformer by 3.59x on the geomean average. It achieves an enviable 47.64x and 278.1x energy saving when computing attention, compared to previous attention-optimization-only SOTA works ELSA and Sanger. Further, FACT achieves an energy efficiency of 4388 GOPS/W performing the whole Transformer layer on average, which is 94.98x higher than Nvidia V100 GPU.

查看译文

关键词

transformer,hardware accelerator,efficient computing,algorithm-hardware co-design,neural network

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要