Elephant, Do Not Forget Everything! Efficient Processing Of Growing Datasets

Joerg Schad,Jorge-Arnulfo Quiane-Ruiz,Jens Dittrich

Cloud Computing（2013）

引用 7|浏览2

暂无评分

摘要

MapReduce has become quite popular to analyse very large datasets. Nevertheless, users typically have to run their MapReduce jobs over the whole dataset every time the dataset is appended by new records. Some researchers have proposed to reuse the intermediate data produced by previous MapReduce jobs. However, existing works still have to read the whole dataset in order to identify which parts of the dataset changed. Furthermore, storing intermediate results is not suitable in some cases, because it can lead to a very high storage overhead.In this paper, we propose Itchy, a MapReduce-based system that employes a set of different techniques to efficiently deal with growing datasets. Itchy uses an optimizer to automatically choose the right technique to process a MapReduce job. The beauty of Itchy is that it does not have to read the whole dataset again to deal with new records. In more detail, Itchy keeps track of the provenance of intermediate results in order to selectively recompute intermediate results as required. But, if intermediate results are small or the computational cost of map functions is high, Itchy can automatically start storing intermediate results rather than the provenance information. Additionally, Itchy also supports the option of directly merging outputs from several jobs in cases where MapReduce jobs allow for such kind of processing. We evaluate Itchy using two different benchmarks and compare it with Hadoop and Incoop. The results show the superiority of Itchy over both baseline systems for processing incremental jobs. In terms of job runtime, Itchy is more than one order of magnitude faster than Hadoop (up to similar to 41 times faster) and Incoop (up to similar to 11 times faster).

查看译文

关键词

different technique,efficient processing,new record,previous mapreduce job,high storage overhead,recompute intermediate result,whole dataset,intermediate result,intermediate data,different benchmarks,mapreduce job,parallel processing,data analysis,merging

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要