Anchor-based Large Language Models
CoRR(2024)
摘要
Large language models (LLMs) predominantly employ decoder-only transformer
architectures, necessitating the retention of keys/values information for
historical tokens to provide contextual information and avoid redundant
computation. However, the substantial size and parameter volume of these LLMs
require massive GPU memory. This memory demand increases with the length of the
input text, leading to an urgent need for more efficient methods of information
storage and processing. This study introduces the Anchor-based LLM (AnLLM),
which utilizes an innovative anchor-based self-attention network (AnSAN) and
also an anchor-based inference strategy. This approach enables LLMs to compress
sequence information into an anchor token, reducing the keys/values cache and
enhancing inference efficiency. Experiments show that the AnLLM maintains
comparable accuracy with up to 99
times faster inference. Despite a minor compromise in accuracy, the AnLLM
significantly improves computational efficiency and resource utilization,
demonstrating the potential of the anchor-based attention approach in the
context of LLMs for real-time inference in practical applications.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要