Not all Layers of LLMs are Necessary during Inference

Siqi Fan,Xin Jiang,Xiang Li,Xuying Meng,Peng Han,Shuo Shang,Aixin Sun,Yequan Wang, Zhongyuan Wang

arxiv（2024）

引用 0|浏览21

暂无评分

摘要

The inference phase of Large Language Models (LLMs) is very expensive. An ideal inference stage of LLMs could utilize fewer computational resources while still maintaining its capabilities (e.g., generalization and in-context learning ability). In this paper, we try to answer the question, "During LLM inference, can we use shallow layers for easy instances; and deep layers for hard ones?" To answer this question, we first indicate that Not all Layers are Necessary during Inference by statistically analyzing the activated layers across tasks. Then, we propose a simple algorithm named AdaInfer to determine the inference termination moment based on the input instance adaptively. More importantly, AdaInfer does not alter LLM parameters and maintains generalizability across tasks. Experiments on well-known LLMs (i.e., Llama2 series and OPT) show that AdaInfer saves an average of 14.8 resources, even up to 50 performance. Additionally, this method is orthogonal to other model acceleration techniques, potentially boosting inference efficiency further.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要