Quality, retrieval and analysis of provenance in large-scale data

Quality, retrieval and analysis of provenance in large-scale data(2014)

引用 23|浏览41
暂无评分
摘要
Provenance is metadata that describes the lineage of a data product. Lineage is invaluable in advancing the reuse and reproducibility of scientific results in e-Science. Through the availability of provenance, future researchers can make valid assessments of data quality or consider the trustworthiness of the data. The shift towards 'Big Data' has presented challenges in provenance driven by data volume and variety, and the need for making data more valuable and veracious. This dissertation examines provenance quality, capture, and representation particularly for highly voluminous provenance that occurs with growing frequency in large-scale science.This work has at its core a framework and methodology that identify three dimensions of provenance quality: correctness, completeness, and relevance. Based on the proposed quality dimensions, the framework supports provenance quality analysis at the node/edge, graph, and multi-graph levels, which includes analysis of annotations, timestamps and the structure of provenance traces. A supporting contribution is the design and generation of a pseudo-realistic provenance workload that consists of 48,000 provenance traces, forming a provenance database 10 Gigabytes in size. This workload is composed of provenance from 6 varied realistic workflows and includes a failure model that introduces several types of failures into provenance data including workflow executions that experienced failures and workflow executions that experienced faults in message passing communication between application and provenance system, the latter resulting in dropped provenance.Provenance in High Performance Computing is directly addressed with the design of a cache storage solution that supports multi-level provenance capture with minimum collection overhead. A distributed NoSQL database stores the collected provenance. Evaluation is carried out through experiments performed on two production systems at the National Energy Research Scientific Computing Center.The final contribution is in the experimental evaluation of two storage approaches for provenance, graph and relational databases, and the impact on retrieval for provenance specific realistic queries. Results carried out at scale and using real-world provenance traces show that graph databases are better suited for the retrieval of large provenance graphs by ID and relational databases provide a better option for provenance graphs that are of great depth in evaluated scenarios.
更多
查看译文
关键词
provenance data,provenance database,provenance trace,provenance quality analysis,provenance system,provenance graph,multi-level provenance capture,provenance specific realistic query,large-scale data,large provenance graph,provenance quality
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要