An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets
CoRR(2023)
摘要
Major advancements in computer vision can primarily be attributed to the use
of labeled datasets. However, acquiring labels for datasets often results in
errors which can harm model performance. Recent works have proposed methods to
automatically identify mislabeled images, but developing strategies to
effectively implement them in real world datasets has been sparsely explored.
Towards improved data-centric methods for cleaning real world vision datasets,
we first conduct more than 200 experiments carefully benchmarking recently
developed automated mislabel detection methods on multiple datasets under a
variety of synthetic and real noise settings with varying noise levels. We
compare these methods to a Simple and Efficient Mislabel Detector (SEMD) that
we craft, and find that SEMD performs similarly to or outperforms prior
mislabel detection approaches. We then apply SEMD to multiple real world
computer vision datasets and test how dataset size, mislabel removal strategy,
and mislabel removal amount further affect model performance after retraining
on the cleaned data. With careful design of the approach, we find that mislabel
removal leads per-class performance improvements of up to 8% of a retrained
classifier in smaller data regimes.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要