Automatic Versioning of Time Series Datasets: a FAIR Algorithmic Approach

2022 IEEE 18th International Conference on e-Science (e-Science)(2022)

引用 1|浏览13
暂无评分
摘要
As one of the fundamental concepts underpinning the FAIR (Findability, Accessibility, Interoperability, and Reusability) guiding principles, data provenance entails keeping track of each version for a given dataset from its original to its latest version. However, standard terms to determine and include versioning information in the metadata of a given dataset are still ambiguous and do not explicitly define how to assess the overlap of information between items along a versioning stream. In this work, we propose a novel approach for automatic versioning of time series datasets, based on the use of parameters from two dimensionality reduction approaches, namely Principal Component Analysis and Autoencoders. That is to say, we systematically detect and measure similarities (information distances) in datasets via dimensionality reduction, encode them as different versions, and then automatically generate provenance metadata via a FAIR versioning service using the W3C DCAT 3.0 nomenclature. We illustrate this approach with two time series datasets and demonstrate how the proposed parameters effectively assess the similarity between different data versions. Our results have shown that the proposed version similarity metrics are robust $(s^{(0,1)}=1)$ to the alteration of up to 60% of cells, the removal of up to 60% of rows, and the log-scale transformation of variables. In contrast, row-wise transformations (e.g. converting absolute values to a percentage of a second variable) yield minimal similarity values $(s^{(0,1)} < 0.75)$ . Our code and datasets are openly available to enable reproducibility.
更多
查看译文
关键词
Data Provenance,Dimensionality Reduction,Information Distance,Principal Component Analysis,Findability,Accessibility,Interoperability,Reusability,Open Science,DCAT
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要