Gradient Descent Optimizes Normalization-Free ResNets
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN(2023)
摘要
Recent empirical studies observe that even without normalization, a deep residual network can be trained reliably. We call such a structure as normalization-free Residual Networks (N-F ResNets), which add a learnable parameter a to control the scale of the residual block instead of normalization. However, the theoretical understanding on N-F ResNets is still limited despite their empirical success. In this paper, we provide the first theoretical understanding of N-F ResNets from two perspectives. Firstly, we prove that the gradient descent (GD) algorithm can find the global minimum of the training loss at a linear rate for over-parameterized N-F ResNets. Secondly, we prove that N-F ResNets can avoid the gradient exploding or vanishing problem, by initializing the key parameter a to be a small constant. Notably, we demonstrate that the gradients of N-F ResNets are more stable than those of ResNets with Kaiming initialization. Moreover, empirical experiments on benchmark datasets verify our theoretical results.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要