Statistical Analysis of the Performance of Four Apache Spark ML Algorithms

Journal of Computer Science and Technology(2022)

引用 0|浏览1
暂无评分
摘要
Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess the importance of each feature for a particular task. How-ever, due to the increasing size of currently available databases, distributed processing has become a neces-sity for many tasks. In this context, the Apache Spark ML library is one of the most widely used libraries for performing classification and other tasks with large datasets. Therefore, knowing both the predictive per-formance and efficiency of its main algorithms before applying a FS technique is crucial to planning compu-tations and saving time. In this work, a comparative study of four Spark ML classification algorithms is car-ried out, statistically measuring execution times and predictive power based on the number of attributes from a colon cancer database. Results were statisti-cally analyzed, showing that, although Random Forest and Na?ve Bayes are the algorithms with the short -est execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, im-age classification, prediction of computer attacks in network security problems, among others.
更多
查看译文
关键词
big data,machine learning,classification models,apache spark,spark ml,wilcoxon test,student’s t test
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要