A Cost Model For Spark Sql

IEEE Transactions on Knowledge and Data Engineering(2019)

引用 19|浏览13
暂无评分
摘要
In this paper, we propose a novel cost model for Spark SQL. The cost model covers the class of Generalized Projection, Selection, Join (GPSJ) queries. The cost model keeps into account the network and IO costs as well as the most relevant CPU costs. The execution cost is computed starting from a physical plan produced by Spark. The set of operations adopted by Spark when executing a GPSJ query are analytically modeled based on the cluster and application parameters, together with a set of database statistics. Experimental results carried out on three benchmarks and on two clusters of different sizes and with different computation features show that our model can estimate the actual execution time with about the 20 percent of errors on the average. Such an accuracy is good enough to let the system choose the most effective plan even when the execution time differences are limited. The error can be reduced to 14 percent, if the analytic model is coupled with our straggler handling strategy.
更多
查看译文
关键词
Spark,Spark SQL,cost model,query optimization
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要