Bridging Lottery ticket and Grokking: Is Weight Norm Sufficient to Explain Delayed Generalization?
arxiv(2023)
摘要
Grokking is one of the most surprising puzzles in neural network
generalization: a network first reaches a memorization solution with perfect
training accuracy and poor generalization, but with further training, it
reaches a perfectly generalized solution. We aim to analyze the mechanism of
grokking from the lottery ticket hypothesis, identifying the process to find
the lottery tickets (good sparse subnetworks) as the key to describing the
transitional phase between memorization and generalization. We refer to these
subnetworks as ”Grokking tickets”, which is identified via magnitude pruning
after perfect generalization. First, using ”Grokking tickets”, we show that
the lottery tickets drastically accelerate grokking compared to the dense
networks on various configurations (MLP and Transformer, and an arithmetic and
image classification tasks). Additionally, to verify that ”Grokking ticket”
are a more critical factor than weight norms, we compared the ”good”
subnetworks with a dense network having the same L1 and L2 norms. Results show
that the subnetworks generalize faster than the controlled dense model. In
further investigations, we discovered that at an appropriate pruning rate,
grokking can be achieved even without weight decay. We also show that speedup
does not happen when using tickets identified at the memorization solution or
transition between memorization and generalization or when pruning networks at
the initialization (Random pruning, Grasp, SNIP, and Synflow). The results
indicate that the weight norm of network parameters is not enough to explain
the process of grokking, but the importance of finding good subnetworks to
describe the transition from memorization to generalization. The implementation
code can be accessed via this link:
.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要