Shark: SQL and rich analytics at scale

international conference on management of data, Volume abs/1211.6176, 2013, Pages 13-24.

Cited by: 495|Bibtex|Views92|Links
EI
Keywords:
dynamic mid-query replanningfine-grained fault tolerance propertyfailures mid-querycomplex analyticsrich analyticsMore(13+)
Wei bo:
Shark significantly enhances a MapReduce-like runtime to efficiently run SQL, by using existing database techniques and a novel partial DAG execution technique that leverages fine-grained data statistics to dynamically reoptimize queries at run-time. This design enables Shark to ...

Abstract:

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g. iterative machine learning) at scale, and efficiently recovers from failures mid-qu...More

Code:

Data:

0
Introduction
  • Modern data analysis faces a confluence of growing challenges. First, data volumes are expanding dramatically, creating the need to scale out across clusters of hundreds of commodity machines.
  • The complexity of data analysis has grown: modern data analysis employs sophisticated statistical methods, such as machine learning algorithms, that go well beyond the roll-up and drill-down capabilities of traditional enterprise data warehouse systems.
  • Despite these increases in scale and complexity, users still expect to be able to query data at interactive speeds.
  • The first, consisting of MapReduce [17]
Highlights
  • Modern data analysis faces a confluence of growing challenges
  • To optimize SQL queries based on the data characteristics even in the presence of analytics functions and UDFs, we extended Spark with Partial DAG Execution (PDE): Shark can reoptimize a running query after running the first few stages of its task DAG, choosing better join strategies or the right degree of parallelism based on observed statistics
  • To support dynamic query optimization in a distributed setting, we extended Spark to support partial DAG execution (PDE), a technique that allows dynamic alteration of query plans based on data statistics collected at run-time
  • We have presented Shark, a new data warehouse system that combines fast relational queries and complex analytics in a single, faulttolerant runtime
  • Shark significantly enhances a MapReduce-like runtime to efficiently run SQL, by using existing database techniques and a novel partial DAG execution (PDE) technique that leverages fine-grained data statistics to dynamically reoptimize queries at run-time. This design enables Shark to approach the speedups reported for MPP databases over MapReduce, while providing support for machine learning algorithms, as well as mid-query fault tolerance across both SQL queries and machine learning computations
  • This research represents an important step towards a unified architecture for efficiently combining complex analytics and relational query processing
Methods
  • Methodology and Cluster Setup

    Unless otherwise specified, experiments were conducted on Amazon EC2 using 100 m2.4xlarge nodes.
  • For Hadoop MapReduce, the number of map tasks and the number of reduce tasks per node were set to 8, matching the number of cores.
  • The authors discard the first run in order to allow the JVM’s just-in-time compiler to optimize common code paths.
  • The authors believe that this more closely mirrors realworld deployments where the JVM will be reused by many queries
Conclusion
  • Shark significantly enhances a MapReduce-like runtime to efficiently run SQL, by using existing database techniques and a novel partial DAG execution (PDE) technique that leverages fine-grained data statistics to dynamically reoptimize queries at run-time.
  • This design enables Shark to approach the speedups reported for MPP databases over MapReduce, while providing support for machine learning algorithms, as well as mid-query fault tolerance across both SQL queries and machine learning computations.
  • They report speedups of 40–100× on real queries, consistent with the results
Summary
  • Introduction:

    Modern data analysis faces a confluence of growing challenges. First, data volumes are expanding dramatically, creating the need to scale out across clusters of hundreds of commodity machines.
  • The complexity of data analysis has grown: modern data analysis employs sophisticated statistical methods, such as machine learning algorithms, that go well beyond the roll-up and drill-down capabilities of traditional enterprise data warehouse systems.
  • Despite these increases in scale and complexity, users still expect to be able to query data at interactive speeds.
  • The first, consisting of MapReduce [17]
  • Methods:

    Methodology and Cluster Setup

    Unless otherwise specified, experiments were conducted on Amazon EC2 using 100 m2.4xlarge nodes.
  • For Hadoop MapReduce, the number of map tasks and the number of reduce tasks per node were set to 8, matching the number of cores.
  • The authors discard the first run in order to allow the JVM’s just-in-time compiler to optimize common code paths.
  • The authors believe that this more closely mirrors realworld deployments where the JVM will be reused by many queries
  • Conclusion:

    Shark significantly enhances a MapReduce-like runtime to efficiently run SQL, by using existing database techniques and a novel partial DAG execution (PDE) technique that leverages fine-grained data statistics to dynamically reoptimize queries at run-time.
  • This design enables Shark to approach the speedups reported for MPP databases over MapReduce, while providing support for machine learning algorithms, as well as mid-query fault tolerance across both SQL queries and machine learning computations.
  • They report speedups of 40–100× on real queries, consistent with the results
Tables
  • Table1: Table 1
  • Table2: Stage 2
Download tables as Excel
Related work
  • To the best of our knowledge, Shark is the only low-latency system that can efficiently combine SQL and machine learning workloads, while supporting fine-grained fault recovery.

    We categorize large-scale data analytics systems into three classes. First, systems like ASTERIX [9], Tenzing [13], SCOPE [12], Cheetah [14], and Hive [34] compile declarative queries into MapReducestyle jobs. Although some of them modify the execution engine they are built on, it is hard for these systems to achieve interactive query response times for reasons discussed in Section 7.

    Second, several projects aim to provide low-latency engines using architectures resembling shared-nothing parallel databases. Such projects include PowerDrill [20] and Impala [1]. These systems do not support fine-grained fault tolerance. In case of mid-query faults, the entire query needs to be re-executed. Google’s Dremel [29] does rerun lost tasks, but it only supports an aggregation tree topology for query execution, and not the more complex shuffle DAGs required for large joins or distributed machine learning.
Funding
  • This research is supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA XData Award FA875012-2-0331, and gifts from Amazon Web Services, Google, SAP, Blue Goji, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, General Electric, Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Quanta, Samsung, Splunk, VMware and Yahoo!, and by a Google PhD Fellowship
Reference
  • https://github.com/cloudera/impala.[2] http://hadoop.apache.org/.
    Findings
  • [3] http://aws.amazon.com/elasticmapreduce/.
    Findings
  • [4] A. Abouzeid et al. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. VLDB, 2009.
    Google ScholarLocate open access versionFindings
  • [5] S. Agarwal et al. Re-optimizing data-parallel computing. In NSDI’12.
    Google ScholarFindings
  • [6] G. Ananthanarayanan et al. Pacman: Coordinated memory caching for parallel jobs. In NSDI, 2012.
    Google ScholarLocate open access versionFindings
  • [7] R. Avnur and J. M. Hellerstein. Eddies: continuously adaptive query processing. In SIGMOD, 2000.
    Google ScholarLocate open access versionFindings
  • [9] A. Behm et al. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185–216, 2011.
    Google ScholarLocate open access versionFindings
  • [10] V. Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE’11.
    Google ScholarFindings
  • [11] Y. Bu et al. HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow., 2010.
    Google ScholarLocate open access versionFindings
  • [12] R. Chaiken et al. Scope: easy and efficient parallel processing of massive data sets. VLDB, 2008.
    Google ScholarFindings
  • [13] B. Chattopadhyay,, et al. Tenzing a sql implementation on the mapreduce framework. PVLDB, 4(12):1318–1327, 2011.
    Google ScholarLocate open access versionFindings
  • [14] S. Chen. Cheetah: a high performance, custom data warehouse on top of mapreduce. VLDB, 2010.
    Google ScholarLocate open access versionFindings
  • [15] C. Chu et al. Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19:281, 2007.
    Google ScholarLocate open access versionFindings
  • [16] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. VLDB, 2009.
    Google ScholarLocate open access versionFindings
  • [17] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
    Google ScholarLocate open access versionFindings
  • [18] X. Feng et al. Towards a unified architecture for in-rdbms analytics. In SIGMOD, 2012.
    Google ScholarLocate open access versionFindings
  • [19] B. Guffler et al. Handling data skew in mapreduce. In CLOSER’11.
    Google ScholarFindings
  • [20] A. Hall et al. Processing a trillion cells per mouse click. VLDB.
    Google ScholarLocate open access versionFindings
  • [21] B. Hindman et al. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI’11.
    Google ScholarFindings
  • [22] M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS, 2007.
    Google ScholarLocate open access versionFindings
  • [23] M. Isard et al. Quincy: Fair scheduling for distributed computing clusters. In SOSP ’09, 2009.
    Google ScholarLocate open access versionFindings
  • [24] M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009.
    Google ScholarLocate open access versionFindings
  • [25] N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD, 1998.
    Google ScholarLocate open access versionFindings
  • [26] Y. Kwon et al. Skewtune: mitigating skew in mapreduce applications. In SIGMOD ’12, 2012.
    Google ScholarLocate open access versionFindings
  • [27] Y. Low et al. Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB, 2012.
    Google ScholarLocate open access versionFindings
  • [28] G. Malewicz et al. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.
    Google ScholarLocate open access versionFindings
  • [29] S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3:330–339, Sept 2010.
    Google ScholarLocate open access versionFindings
  • [30] K. Ousterhout et al. The case for tiny tasks in compute clusters. In HotOS’13.
    Google ScholarFindings
  • [31] A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.
    Google ScholarLocate open access versionFindings
  • [32] M. Stonebraker et al. C-store: a column-oriented dbms. In VLDB’05.
    Google ScholarLocate open access versionFindings
  • [33] M. Stonebraker et al. Mapreduce and parallel dbmss: friends or foes? Commun. ACM.
    Google ScholarFindings
  • [34] A. Thusoo et al. Hive-a petabyte scale data warehouse using hadoop. In ICDE, 2010.
    Google ScholarLocate open access versionFindings
  • [35] Transaction Processing Performance Council. TPC BENCHMARK H.
    Google ScholarLocate open access versionFindings
  • [36] T. Urhan, M. J. Franklin, and L. Amsaleg. Cost-based query scrambling for initial delays. In SIGMOD, 1998.
    Google ScholarLocate open access versionFindings
  • [37] C. Yang et al. Osprey: Implementing mapreduce-style fault tolerance in a shared-nothing distributed database. In ICDE, 2010.
    Google ScholarLocate open access versionFindings
  • [38] M. Zaharia et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys 10, 2010.
    Google ScholarLocate open access versionFindings
  • [39] M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments