Effective pruning of web-scale datasets based on complexity of concept clusters
CoRR(2024)
摘要
Utilizing massive web-scale datasets has led to unprecedented performance
gains in machine learning models, but also imposes outlandish compute
requirements for their training. In order to improve training and data
efficiency, we here push the limits of pruning large-scale multimodal datasets
for training CLIP-style models. Today's most effective pruning method on
ImageNet clusters data samples into separate concepts according to their
embedding and prunes away the most prototypical samples. We scale this approach
to LAION and improve it by noting that the pruning rate should be
concept-specific and adapted to the complexity of the concept. Using a simple
and intuitive complexity measure, we are able to reduce the training cost to a
quarter of regular training. By filtering from the LAION dataset, we find that
training on a smaller set of high-quality data can lead to higher performance
with significantly lower training costs. More specifically, we are able to
outperform the LAION-trained OpenCLIP-ViT-B32 model on ImageNet zero-shot
accuracy by 1.1p.p. while only using 27.7
Despite a strong reduction in training cost, we also see improvements on
ImageNet dist. shifts, retrieval tasks and VTAB. On the DataComp Medium
benchmark, we achieve a new state-of-the-art ImageNet zero-shot accuracy and a
competitive average zero-shot accuracy on 38 evaluation tasks.
更多查看译文
关键词
pruning,large-scale,data curation,concept-based,LAION,DataComp
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要