Not all Layers of LLMs are Necessary during Inference
arxiv(2024)
摘要
The inference phase of Large Language Models (LLMs) is very expensive. An
ideal inference stage of LLMs could utilize fewer computational resources while
still maintaining its capabilities (e.g., generalization and in-context
learning ability). In this paper, we try to answer the question, "During LLM
inference, can we use shallow layers for easy instances; and deep layers for
hard ones?" To answer this question, we first indicate that Not all Layers are
Necessary during Inference by statistically analyzing the activated layers
across tasks. Then, we propose a simple algorithm named AdaInfer to determine
the inference termination moment based on the input instance adaptively. More
importantly, AdaInfer does not alter LLM parameters and maintains
generalizability across tasks. Experiments on well-known LLMs (i.e., Llama2
series and OPT) show that AdaInfer saves an average of 14.8
resources, even up to 50
performance. Additionally, this method is orthogonal to other model
acceleration techniques, potentially boosting inference efficiency further.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要