Scalable Parallel Machine Learning Computing a Summarization Matrix with SQL Queries

2022 IEEE International Conference on Big Data (Big Data)(2022)

引用 0|浏览13
暂无评分
摘要
Multidimensional data summarization is a fundamental mechanism to accelerate the computation of machine learning (ML) models. On the other hand, relational DBMSs can scale beyond main memory limits, they can evaluate SQL queries in parallel and they hide complex internal system details. Heeding this motivation, we present a wide spectrum of alternative SQL queries to compute a summarization matrix that significantly accelerates the computation of many ML models in a data science language (e.g. Python). We consider two fundamental storage layouts: horizontal and vertical. Our proposed SQL queries lead to diverse query plans, which in turn yield highly different processing times. We identify storage layout (row vs column) and relational join optimization as two key performance factors. After careful analysis and bechmarking, we recommend two SQL queries that can work across DBMSs. We show UDFs, an extensibility mechanism, despite being faster, they have many disadvantages compared to plain SQL queries (not portable, system-dependent limitations, main memory, manual optimization required). An extensive experimental evaluation shows the pros and cons of our proposed SQL-based solution. Columnar storage provides an order of magnitude performance improvement over row storage. Moreover, SQL queries can match UDF performance on sparse matrices. We show that by exploiting the summarization matrix in Python, the computation of two popular statistical models (Linear Regression and PCA), is much faster than popular Python libraries (on a single machine) and also faster than Apache Spark (in parallel, in-memory solution for big data clusters). We also show our SQL-based solution exhibits linear speedup in parallel processing. In short, the DBMS can act as a backend linear algebra kernel.
更多
查看译文
关键词
SQL,Gramian matrix,Linear Algebra,Indatabase,Query
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要