YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics

2016 IEEE International Conference on Big Data (Big Data)(2016)

引用 8|浏览65
暂无评分
摘要
Machine learning and graph analytics typically process data in an iterative way, reading the same data multiple times and sharing intermediate results across the worker nodes in cluster. Hadoop MapReduce and Spark are two popular open source cluster compute frameworks for large scale data analytics. Apache Spark is currently the state-of-the-art in-memory computation model extending MapReduce by transforming data into RDDs stored in memory. One limitation of Spark, however, lies in the fact that data transformation and distribution is implicitly managed by HDFS. Data locality is not guaranteed for iterative machine learning algorithms which read the same data multiple times. For example, data needed for operations to one worker node might reside in RDDs stored in other worker nodes. The resulting data shuffling becomes a bottleneck when iteratively reading such RDDs. We propose YinMem, a parallel distributed indexed in-memory computation system, bridging the gap between Hadoop ecosystem and HPC by replacing MapReduce with MPI while obtaining the advantage of the distributed data storage. YinMem achieves fair load balancing prior to computation for large sparse matrix by scheduling and distributing indexed data from NoSQL database to the RAM of working nodes. YinMem explores Alluxio as the in-memory storage system and enables efficient data sharing of intermediate results. Preliminary results show that YinMem has achieved 3× speedup to Spark, for computing eigenvalue and eigenvectors of a 16-million scale sparse matrix.
更多
查看译文
关键词
NoSQL database,large sparse matrix,distributed data storage,MPI,HPC,iterative machine learning algorithms,memory computation model,Spark,Hadoop MapReduce,graph analytics,machine learning,large scale data analytics,distributed parallel indexed in-memory computation system,YinMem
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要