Exploring Learning Complexity for Downstream Data Pruning
CoRR(2024)
摘要
The over-parameterized pre-trained models pose a great challenge to
fine-tuning with limited computation resources. An intuitive solution is to
prune the less informative samples from the fine-tuning dataset. A series of
training-based scoring functions are proposed to quantify the informativeness
of the data subset but the pruning cost becomes non-negligible due to the heavy
parameter updating. For efficient pruning, it is viable to adapt the similarity
scoring function of geometric-based methods from training-based to
training-free. However, we empirically show that such adaption distorts the
original pruning and results in inferior performance on the downstream tasks.
In this paper, we propose to treat the learning complexity (LC) as the scoring
function for classification and regression tasks. Specifically, the learning
complexity is defined as the average predicted confidence of subnets with
different capacities, which encapsulates data processing within a converged
model. Then we preserve the diverse and easy samples for fine-tuning. Extensive
experiments with vision datasets demonstrate the effectiveness and efficiency
of the proposed scoring function for classification tasks. For the instruction
fine-tuning of large language models, our method achieves state-of-the-art
performance with stable convergence, outperforming the full training with only
10% of the instruction dataset.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要