Automatic Caching Decision for Scientific Dataflow Execution in Apache Spark

BeyondMR@SIGMOD(2018)

引用 17|浏览74
暂无评分
摘要
Demands for large-scale data analysis and processing have led to the development and widespread adoption of computing frameworks that leverage in-memory data processing, largely outperforming disk-based processing systems. One such framework is Apache Spark, which adopts a lazy-evaluation execution mode. In this model, the execution of a transformation dataflow operation is delayed until its results are required by an action. Furthermore, transformation's results are not kept in memory by default, and the same transformation must be re-executed whenever required by another action. In order to spare unnecessary re-execution of entire pipelines of frequently referenced operations, Spark enables the programmer to explicitly define a cache operation to persist transformation results. However, many factors affect the efficiency of a cache in a dataflow, including the existence of other cache operations. Thus, even with a reasonably small number of transformations, choosing the optimal combination of cache operations poses a nontrivial problem. The problem is highlighted by the fact that intuitive strategies -- especially considered in isolation - may actually be harmful to the dataflow efficiency. In this work, we present an automatic procedure to compute the substantially optimal combination of cache operations given a dataflow definition and a simple model for the cost of the operations. Our results over an astronomy dataflow use case show that our algorithm is resilient to changes in the dataflow and cost model, and that it outperforms intuitive strategies, consistently deciding on a substantially optimal combination of caches.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要