Complaint-Driven Training Data Debugging at Interactive Speeds

Lampros Flokas,Weiyuan Wu,Yejia Liu,Jiannan Wang,Nakul Verma,Eugene Wu

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22)（2022）

引用 4|浏览13

暂无评分

摘要

Modern databases support queries that perform model inference (inference queries). Although powerful and widely used, inference queries are susceptible to incorrect results if the model is biased due to training data errors. Recently, Rain [44] proposed complaint-driven data debugging which uses user-specified errors in the output of inference queries (Complaints) to rank erroneous training examples that most likely caused the complaint. This can help users better interpret results and debug training sets. Rain combined influence analysis from the ML literature with relaxed query provenance polynomials from the DB literature to approximate the derivative of complaints w.r.t. training examples. Although effective, the runtime is O(vertical bar T vertical bar d), where T and d are the training set and model sizes, due to its reliance on the model's second order derivatives (the Hessian). On a Wide Resnet Network (WRN) model with 1.5 million parameters, it takes >1 minute to debug a complaint. We observe that most complaint debugging costs are independent of the complaint, and that modern models are overparameterized. In response, Rain++ uses precomputation techniques, based on non-trivial insights unique to data debugging, to reduce debugging latencies to a constant factor independent of model size. We also develop optimizations when the queried database is known apriori, and for standing queries over streaming databases. Combining these optimizations in Rain++ ensures interactive debugging latencies (similar to 1ms) on models with millions of parameters.

查看译文

关键词

machine learning debugging, data cleaning, data provenance

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要