Provenance–aware workflow for data quality management and improvement for large continuous scientific data streams

user-5ca99f0c530c702a92b1df51(2019)

引用 2|浏览8
暂无评分
摘要
Data quality assessment, management and improvement is an integral part of any big data intensive scientific research to ensure accurate, reliable, and reproducible scientific discoveries. The task of maintaining the quality of data, however, is non-trivial and poses a challenge for a program like the Department of Energy's Atmospheric Radiation Measurement (ARM) that collects data from hundreds of instruments across the world, and distributes thousands of streaming data products that are continuously produced in near-real-time for an archive 1.7 Petabyte in size and growing. In this paper, we present a computational data processing workflow to address the data quality issues via an easy and intuitive web-based portal that allows reporting of any quality issues for any site, facility or instruments at a granularity down to individual variables in the data files. This portal allows instrument specialists and scientists to provide corrective actions in the form of symbolic equations. A parallel processing framework applies the data improvement to a large volume of data in an efficient, parallel environment, while optimizing data transfer and file I/O operations; corrected files are then systematically versioned and archived. A provenance tracking module tracks and records any change made to the data during its entire life cycle which are communicated transparently to the scientific users. Developed in Python using open source technologies, this software architecture enables fast and efficient management and improvement of data in an operational data center environment.
更多
查看译文
关键词
Index Terms-scientific data workflows,data quality,prove- nance,atmospheric science
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要