How Do Nonlinear Transformers Learn and Generalize in In-Context Learning?

Hongkang Li,Meng Wang,Songtao Lu,Xiaodong Cui,Pin-Yu Chen

ICML 2024（2024）

引用 0|浏览25

暂无评分

摘要

Transformer-based large language models have displayed impressive in-contextlearning capabilities, where a pre-trained model can handle new tasks withoutfine-tuning by simply augmenting the query with some input-output examples fromthat task. Despite the empirical success, the mechanics of how to train aTransformer to achieve ICL and the corresponding ICL capacity is mostly elusivedue to the technical challenges of analyzing the nonconvex training problemsresulting from the nonlinear self-attention and nonlinear activation inTransformers. To the best of our knowledge, this paper provides the firsttheoretical analysis of the training dynamics of Transformers with nonlinearself-attention and nonlinear MLP, together with the ICL generalizationcapability of the resulting model. Focusing on a group of binary classificationtasks, we train Transformers using data from a subset of these tasks andquantify the impact of various factors on the ICL generalization performance onthe remaining unseen tasks with and without data distribution shifts. We alsoanalyze how different components in the learned Transformers contribute to theICL performance. Furthermore, we provide the first theoretical analysis of howmodel pruning affects ICL performance and prove that proper magnitude-basedpruning can have a minimal impact on ICL while reducing inference costs. Thesetheoretical findings are justified through numerical experiments.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要