Understanding the Training Speedup from Sampling with Approximate Losses
CoRR(2024)
摘要
It is well known that selecting samples with large losses/gradients can
significantly reduce the number of training steps. However, the selection
overhead is often too high to yield any meaningful gains in terms of overall
training time. In this work, we focus on the greedy approach of selecting
samples with large approximate losses instead of exact losses in order
to reduce the selection overhead. For smooth convex losses, we show that such a
greedy strategy can converge to a constant factor of the minimum value of the
average loss in fewer iterations than the standard approach of random
selection. We also theoretically quantify the effect of the approximation
level. We then develop SIFT which uses early exiting to obtain approximate
losses with an intermediate layer's representations for sample selection. We
evaluate SIFT on the task of training a 110M parameter 12-layer BERT base model
and show significant gains (in terms of training hours and number of
backpropagation steps) without any optimized implementation over vanilla
training. For e.g., to reach 64
first layer takes 43 hours compared to 57 hours of vanilla training.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要