Towards Uncovering How Large Language Model Works: An Explainability Perspective
arxiv(2024)
摘要
Large language models (LLMs) have led to breakthroughs in language tasks, yet
the internal mechanisms that enable their remarkable generalization and
reasoning abilities remain opaque. This lack of transparency presents
challenges such as hallucinations, toxicity, and misalignment with human
values, hindering the safe and beneficial deployment of LLMs. This paper aims
to uncover the mechanisms underlying LLM functionality through the lens of
explainability. First, we review how knowledge is architecturally composed
within LLMs and encoded in their internal parameters via mechanistic
interpretability techniques. Then, we summarize how knowledge is embedded in
LLM representations by leveraging probing techniques and representation
engineering. Additionally, we investigate the training dynamics through a
mechanistic perspective to explain phenomena such as grokking and memorization.
Lastly, we explore how the insights gained from these explanations can enhance
LLM performance through model editing, improve efficiency through pruning, and
better align with human values.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要