AAN plus : Generalized Average Attention Network for Accelerating Neural Transformer

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH(2022)

引用 0|浏览36
暂无评分
摘要
Transformer benefits from the high parallelization of attention networks in fast training, but it still suffers from slow decoding partially due to the linear dependency O(m) of the decoder self-attention on previous target words at inference. In this paper, we propose a generalized average attention network (AAN+) aiming at speeding up decoding by reducing the dependency from O(m) to O(1). We find that the learned self-attention weights in the decoder follow some patterns which can be approximated via a dynamic structure. Based on this insight, we develop AAN+, extending our previously proposed average attention (Zhang et al., 2018a, AAN) to support more general position-and content-based attention patterns. AAN+ only requires to maintain a small constant number of hidden states during decoding, ensuring its O(1) dependency. We apply AAN+ as a drop-in replacement of the decoder self -attention and conduct experiments on machine translation (with diverse language pairs), table-to-text generation and document summarization. With masking tricks and dynamic programming, AAN+ enables Transformer to decode sentences around 20% faster without largely compromising in the training speed and the generation performance. Our results further reveal the importance of the localness (neighboring words) in AAN+ and its capability in modeling long-range dependency.
更多
查看译文
关键词
generalized average attention network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要