Hydragen: High-Throughput LLM Inference with Shared Prefixes
CoRR(2024)
摘要
Transformer-based large language models (LLMs) are now deployed to hundreds
of millions of users. LLM inference is commonly performed on batches of
sequences that share a prefix, such as few-shot examples or a chatbot system
prompt. Decoding in this large-batch setting can be bottlenecked by the
attention operation, which reads large key-value (KV) caches from memory and
computes inefficient matrix-vector products for every sequence in the batch. In
this work, we introduce Hydragen, a hardware-aware exact implementation of
attention with shared prefixes. Hydragen computes attention over the shared
prefix and unique suffixes separately. This decomposition enables efficient
prefix attention by batching queries together across sequences, reducing
redundant memory reads and enabling the use of hardware-friendly matrix
multiplications. Our method can improve end-to-end LLM throughput by up to 32x
against competitive baselines, with speedup growing with the batch size and
shared prefix length. Hydragen also enables the use of very long shared
contexts: with a high batch size, increasing the prefix length from 1K to 16K
tokens decreases Hydragen throughput by less than 15
baselines drops by over 90
decomposition and can be applied to tree-based prompt sharing patterns,
allowing us to further reduce inference time on competitive programming
problems by 55
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要