Gradient Descent Optimizes Normalization-Free ResNets

Zongpeng Zhang,Zenan Ling,Tong Lin,Zhouchen Lin

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN（2023）

引用 0|浏览6

暂无评分

摘要

Recent empirical studies observe that even without normalization, a deep residual network can be trained reliably. We call such a structure as normalization-free Residual Networks (N-F ResNets), which add a learnable parameter a to control the scale of the residual block instead of normalization. However, the theoretical understanding on N-F ResNets is still limited despite their empirical success. In this paper, we provide the first theoretical understanding of N-F ResNets from two perspectives. Firstly, we prove that the gradient descent (GD) algorithm can find the global minimum of the training loss at a linear rate for over-parameterized N-F ResNets. Secondly, we prove that N-F ResNets can avoid the gradient exploding or vanishing problem, by initializing the key parameter a to be a small constant. Notably, we demonstrate that the gradients of N-F ResNets are more stable than those of ResNets with Kaiming initialization. Moreover, empirical experiments on benchmark datasets verify our theoretical results.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要