Diagnosing Machine Learning Pipelines with Fine-grained Lineage.
HPDC(2017)
摘要
We present the Hippo system to enable the diagnosis of distributed machine learning (ML) pipelines by leveraging fine-grained data lineage. Hippo exposes a concise yet powerful API, derived from primitive lineage types, to capture fine-grained data lineage for each data transformation. It records the input datasets, the output datasets and the cell-level mapping between them. It also collects sufficient information that is needed to reproduce the computation. Hippo efficiently enables common ML diagnosis operations such as code debugging, result analysis, data anomaly removal, and computation replay. By exploiting the metadata separation and high-order function encoding strategies, we observe an O(10^3)x total improvement in lineage storage efficiency vs. the baseline of cell-wise mapping recording while maintaining the lineage integrity. Hippo can answer the real use case lineage queries within a few seconds, which is low enough to enable interactive diagnosis of ML pipelines.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络