GraphX: graph processing in a distributed dataflow framework

OSDI, pp. 599-613, 2014.

Cited by: 827|Bibtex|Views113|Links
EI
Keywords:
vertex programResilient Distributed Datasetsgraph processing systemproperty graphgeneral purposeMore(15+)
Wei bo:
For iterative graph algorithms, GraphX is over an order of magnitude faster than directly using the general-purpose dataflow operators described in Section 3.2 and is comparable to or faster than specialized graph processing systems

Abstract:

In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantag...More

Code:

Data:

Introduction
  • The growing scale and importance of graph data has driven the development of numerous specialized graph processing systems including Pregel [22], PowerGraph [13], and many others [7, 9, 37].
  • In contrast to dataflow systems whose operators can span multiple collections, operations in graph processing systems are typically defined with respect to a single property graph with a pre-declared, sparse structure.
  • While this restricted focus facilitates a range of optimizations (Section 2.3), it complicates the expression of analytics tasks that may span multiple graphs and sub-graphs
Highlights
  • The growing scale and importance of graph data has driven the development of numerous specialized graph processing systems including Pregel [22], PowerGraph [13], and many others [7, 9, 37]
  • We demonstrate that, for iterative graph algorithms, GraphX is over an order of magnitude faster than directly using the general-purpose dataflow operators described in Section 3.2 and is comparable to or faster than specialized graph processing systems
  • In this work we introduced GraphX, an efficient graph processing system that enables distributed dataflow frameworks such as Spark to naturally express and efficiently execute iterative graph algorithms
  • We identified a simple pattern of join–map–group-by dataflow operators that forms the basis of graph-parallel computation
  • For graph algorithms, GraphX is over an order of magnitude faster than the base dataflow system and is comparable to or faster than specialized graph processing systems
  • GraphX benefits from features provided by recent dataflow systems such as low-cost fault tolerance and transparent recovery
Results
  • The authors demonstrate that GraphX can achieve performance parity with specialized graph processing systems while preserving the advantages of a general-purpose dataflow framework.
  • In Section 5.1 the authors exploit the optimized partitioning of the sample datasets to achieve up to 56% reduction in runtime and 5.8× reduction in communication compared to a 2D hash partitioning
Conclusion
  • The work on GraphX addressed several key themes in data management systems and system design: Physical Data Independence: GraphX allows the same physical data to be viewed as collections and as graphs without data movement or duplication.
  • The authors identified a simple pattern of join–map–group-by dataflow operators that forms the basis of graph-parallel computation
  • Inspired by this observation, the authors proposed the GraphX abstraction, 2https://spark.apache.org 3For a large-scale commercial use case see [14].
  • Does GraphX support existing graph-parallel abstractions and a wide range of iterative graph algorithms, it enables the composition of graphs and collections, freeing the user to adopt the most natural view without concern for data movement or duplication.
  • GraphX benefits from features provided by recent dataflow systems such as low-cost fault tolerance and transparent recovery
Summary
  • Introduction:

    The growing scale and importance of graph data has driven the development of numerous specialized graph processing systems including Pregel [22], PowerGraph [13], and many others [7, 9, 37].
  • In contrast to dataflow systems whose operators can span multiple collections, operations in graph processing systems are typically defined with respect to a single property graph with a pre-declared, sparse structure.
  • While this restricted focus facilitates a range of optimizations (Section 2.3), it complicates the expression of analytics tasks that may span multiple graphs and sub-graphs
  • Results:

    The authors demonstrate that GraphX can achieve performance parity with specialized graph processing systems while preserving the advantages of a general-purpose dataflow framework.
  • In Section 5.1 the authors exploit the optimized partitioning of the sample datasets to achieve up to 56% reduction in runtime and 5.8× reduction in communication compared to a 2D hash partitioning
  • Conclusion:

    The work on GraphX addressed several key themes in data management systems and system design: Physical Data Independence: GraphX allows the same physical data to be viewed as collections and as graphs without data movement or duplication.
  • The authors identified a simple pattern of join–map–group-by dataflow operators that forms the basis of graph-parallel computation
  • Inspired by this observation, the authors proposed the GraphX abstraction, 2https://spark.apache.org 3For a large-scale commercial use case see [14].
  • Does GraphX support existing graph-parallel abstractions and a wide range of iterative graph algorithms, it enables the composition of graphs and collections, freeing the user to adopt the most natural view without concern for data movement or duplication.
  • GraphX benefits from features provided by recent dataflow systems such as low-cost fault tolerance and transparent recovery
Tables
  • Table1: Graph Datasets. Both graphs have highly skewed power-law degree distributions
Download tables as Excel
Related work
  • In Section 2 we described the general characteristics shared across many of the earlier graph processing systems. However, there are some exceptions to many of these characteristics that are worth noting.

    While most of the work on large-scale distributed graph processing has focused on static graphs, several systems have focused on various forms of stream processing. One

    610 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)

    of the earlier examples is Kineograph [9], a distributed graph processing system that constructs incremental snapshots of the graph for offline static graph analysis. In the multicore setting, GraphChi [17] and later X-Stream [34] introduced support for the addition of edges between existing vertices and between computation stages. Although conceptually GraphX could support the incremental introduction of edges (and potentially vertices), the existing data-structures would require additional optimization. Instead, GraphX focuses on efficiently supporting the removal of edges and vertices: essential functionality for offline sub-graph analysis.
Funding
  • This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation, Adobe, Apple, Inc., Bosch, C3Energy, Cisco, Cloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk, Virdata, VMware, and Yahoo!. 612 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14)
Reference
  • ABADI, D. J., MARCUS, A., MADDEN, S. R., AND HOLLENBACH, K. SW-Store: A vertically partitioned DBMS for semantic web data management. PVLDB 18, 2 (2009), 385–406.
    Google ScholarLocate open access versionFindings
  • AFRATI, F. N., AND ULLMAN, J. D. Optimizing joins in a map-reduce environment. In EDBT (2010), pp. 99–110.
    Google ScholarLocate open access versionFindings
  • BLANAS, S., PATEL, J. M., ERCEGOVAC, V., RAO, J., SHEKITA, E. J., AND TIAN, Y. A comparison of join algorithms for log processing in MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), SIGMOD ’10, ACM, pp. 975–986.
    Google ScholarLocate open access versionFindings
  • BOLDI, P., ROSA, M., SANTINI, M., AND VIGNA, S. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In WWW (2011), pp. 587–596.
    Google ScholarLocate open access versionFindings
  • BOLDI, P., AND VIGNA, S. The WebGraph framework I: Compression techniques. In WWW’04.
    Google ScholarLocate open access versionFindings
  • BROEKSTRA, J., KAMPMAN, A., AND HARMELEN, F. V. Sesame: A generic architecture for storing and querying rdf and rdf schema. In Proceedings of the First International Semantic Web Conference on The Semantic Web (2002), ISWC ’02, pp. 54–68.
    Google ScholarLocate open access versionFindings
  • BULUC, A., AND GILBERT, J. R. The combinatorial BLAS: design, implementation, and applications. IJHPCA 25, 4 (2011), 496–509.
    Google ScholarLocate open access versionFindings
  • C ATALYU REK, U. V., AYKANAT, C., AND UC AR, B. On twodimensional sparse matrix partitioning: Models, methods, and a recipe. SIAM J. Sci. Comput. 32, 2 (2010), 656–683.
    Google ScholarLocate open access versionFindings
  • CHENG, R., HONG, J., KYROLA, A., MIAO, Y., WENG, X., WU, M., YANG, F., ZHOU, L., ZHAO, F., AND CHEN, E. Kineograph: taking the pulse of a fast-changing and connected world. In EuroSys (2012), pp. 85–98.
    Google ScholarLocate open access versionFindings
  • DEAN, J., AND GHEMAWAT, S. Mapreduce: simplified data processing on large clusters. In OSDI (2004).
    Google ScholarLocate open access versionFindings
  • EWEN, S., TZOUMAS, K., KAUFMANN, M., AND MARKL, V. Spinning fast iterative data flows. Proc. VLDB 5, 11 (July 2012), 1268–1279.
    Google ScholarLocate open access versionFindings
  • FEIGE, U., HAJIAGHAYI, M., AND LEE, J. R. Improved approximation algorithms for minimum-weight vertex separators. In Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing (New York, NY, USA, 2005), STOC ’05, ACM, pp. 563–572.
    Google ScholarLocate open access versionFindings
  • GONZALEZ, J. E., LOW, Y., GU, H., BICKSON, D., AND GUESTRIN, C. Powergraph: Distributed graph-parallel computation on natural graphs. OSDI’12, USENIX Association, pp. 17–30.
    Google ScholarFindings
  • HUANG, A., AND WU, W.
    Google ScholarFindings
  • http://databricks.com/blog/2014/08/14/mining-graph-datawith-spark-at-alibaba-taobao.html, 2014.
    Findings
  • [15] ISARD, M., BUDIU, M., YU, Y., BIRRELL, A., AND FETTERLY, D. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys (2007), pp. 59–72.
    Google ScholarLocate open access versionFindings
  • [16] KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48, 1 (1998), 96–129.
    Google ScholarLocate open access versionFindings
  • [17] KYROLA, A., BLELLOCH, G., AND GUESTRIN, C. GraphChi: Large-scale graph computation on just a PC. In OSDI (2012).
    Google ScholarLocate open access versionFindings
  • [18] LESKOVEC, J., LANG, K. J., DASGUPTA, A.,, AND MAHONEY, M. W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6, 1 (2008), 29–123.
    Google ScholarLocate open access versionFindings
  • [19] LOW, Y., ET AL. GraphLab: A new parallel framework for machine learning. In UAI (2010), pp. 340–349.
    Google ScholarLocate open access versionFindings
  • [20] LOW, Y., GONZALEZ, J., KYROLA, A., BICKSON, D., GUESTRIN, C., AND HELLERSTEIN, J. M. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB (2012).
    Google ScholarLocate open access versionFindings
  • [21] MACKERT, L. F., AND LOHMAN, G. M. R* optimizer validation and performance evaluation for distributed queries. In VLDB’86 (1986), pp. 149–159.
    Google ScholarLocate open access versionFindings
  • [22] MALEWICZ, G., AUSTERN, M. H., BIK, A. J., DEHNERT, J., HORN, I., LEISER, N., AND CZAJKOWSKI, G. Pregel: a system for large-scale graph processing. In SIGMOD (2010), pp. 135– 146.
    Google ScholarLocate open access versionFindings
  • [23] MANOLA, F., AND MILLER, E. RDF primer. W3C Recommendation 10 (2004), 1–107.
    Google ScholarLocate open access versionFindings
  • [24] MONDAL, J., AND DESHPANDE, A. Managing large dynamic graphs efficiently. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2012), SIGMOD ’12, ACM, pp. 145–156.
    Google ScholarLocate open access versionFindings
  • [25] MURRAY, D. Building new frameworks on Naiad. blog post: http://bigdataatsvc.wordpress.com/2014/04/29/buildingnew-frameworks-for-naiad/, April 2014.
    Findings
  • [26] MURRAY, D. G., MCSHERRY, F., ISAACS, R., ISARD, M., BARHAM, P., AND ABADI, M. Naiad: A timely dataflow system. In SOSP ’13.
    Google ScholarFindings
  • [27] NAJORK, M., FETTERLY, D., HALVERSON, A., KENTHAPADI, K., AND GOLLAPUDI, S. Of hammers and nails: An empirical comparison of three paradigms for processing large graphs. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (2012), WSDM ’12, ACM, pp. 103–112.
    Google ScholarLocate open access versionFindings
  • [28] NEUMANN, T., AND WEIKUM, G. RDF-3X: A RISC-style engine for RDF. VLDB’08.
    Google ScholarFindings
  • [29] OLSTON, C., REED, B., SRIVASTAVA, U., KUMAR, R., AND TOMKINS, A. Pig Latin: A not-so-foreign language for data processing. SIGMOD (2008).
    Google ScholarLocate open access versionFindings
  • [30] PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, 1999.
    Google ScholarFindings
  • [31] PRUD’HOMMEAUX, E., AND SEABORNE, A. SPARQL query language for RDF. Latest version available as http://www.w3.org/TR/rdf-sparql-query/, January 2008.
    Findings
  • [32] PUJOL, J. M., ERRAMILLI, V., SIGANOS, G., YANG, X., LAOUTARIS, N., CHHABRA, P., AND RODRIGUEZ, P. The little engine(s) that could: scaling online social networks. In SIGCOMM (2010), pp. 375–386.
    Google ScholarLocate open access versionFindings
  • [33] ROBINSON, I., WEBBER, J., AND EIFREM, E. Graph Databases. O’Reilly Media, Incorporated, 2013.
    Google ScholarFindings
  • [34] ROY, A., MIHAILOVIC, I., AND ZWAENEPOEL, W. X-stream: Edge-centric graph processing using streaming partitions. SOSP ’13, ACM, pp. 472–488.
    Google ScholarFindings
  • [35] SAAD, Y. Iterative Methods for Sparse Linear Systems, 2nd ed. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2003.
    Google ScholarLocate open access versionFindings
  • [36] STANTON, I., AND KLIOT, G. Streaming graph partitioning for large distributed graphs. Tech. Rep. MSR-TR-2011-121, Microsoft Research, November 2011.
    Google ScholarFindings
  • [37] STUTZ, P., BERNSTEIN, A., AND COHEN, W. Signal/collect: graph algorithms for the (semantic) web. In ISWC (2010).
    Google ScholarLocate open access versionFindings
  • [38] UGANDER, J., AND BACKSTROM, L. Balanced label propagation for partitioning massive graphs. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (New York, NY, USA, 2013), WSDM ’13, ACM, pp. 507–516.
    Google ScholarLocate open access versionFindings
  • [39] ZAHARIA, M., CHOWDHURY, M., DAS, T., DAVE, A., MA, J., MCCAULEY, M., FRANKLIN, M. J., SHENKER, S., AND STOICA, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI’12.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments