SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications.

Shafaq Siddiqi,Roman Kern,Matthias Boehm

Proceedings of the ACM on Management of Data(2023)

引用 0|浏览8
暂无评分
摘要
In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.
更多
查看译文
关键词
data cleaning pipelines,machine learning,saga,scalable framework
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要