Easy and Efficient Transformer : Scalable Inference Solution For large NLP model

Gongzheng Li,Yadong Xi,Jingzhen Ding,Duan Wang,Bai Liu, Fan Cheng,Xiaoxi Mao, Zhen Zhao

arXiv (Cornell University)（2021）

引用 0|浏览0

暂无评分

摘要

Recently, large-scale transformer-based models have been proven to be effective over various tasks across many domains. Nevertheless, applying them in industrial production requires tedious and heavy works to reduce inference costs. To fill such a gap, we introduce a scalable inference solution: Easy and Efficient Transformer (EET), including a series of transformer inference optimization at the algorithm and implementation levels. First, we design highly optimized kernels for long inputs and large hidden sizes. Second, we propose a flexible CUDA memory manager to reduce the memory footprint when deploying a large model. Compared with the state-of-the-art transformer inference library (Faster Transformer v4.0), EET can achieve an average of 1.40-4.20x speedup on the transformer decoder layer with an A100 GPU

查看译文

关键词

scalable inference solution,nlp,efficient transformer,model

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要