A Machine Learning-Aware Data Re-partitioning Framework for Spatial Datasets
2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022)(2022)
摘要
Spatial datasets are used extensively to train machine learning (ML) models for applications such as spatial regression, classification, clustering, and deep learning. Most of the real-world spatial datasets are often too large, and many spatial ML algorithms represent the geographical region as a grid consisting of several spatial cells. If the granularity of the grid is too fine, that results in a large number of grid cells leading to long training time and high memory consumption issues during the model training. To alleviate this problem, we propose a machine learning-aware spatial data re-partitioning framework that substantially reduces the granularity of the spatial grid. Our spatial data re-partitioning approach combines fine-grained, adjacent spatial cells from a grid into coarser cells prior to training an ML model. During this re-partitioning phase, we keep the information loss within a user-defined threshold without significantly degrading the accuracy of the ML model. According to the empirical evaluation performed on several real-world datasets, the best results achieved by our spatial re-partitioning framework show that we can reduce the data volume and training time by up to 81%, while keeping the difference in prediction or classification error below 5% as compared to a model that is trained on the original input dataset, for most of the ML applications. Our re-partitioned framework also outperforms the state-of-the-art data reduction baselines by 2% to 20% w.r.t. prediction and classification errors.
更多查看译文
关键词
Spatial Machine Learning, Spatial Data, Training, Time Reduction, Training Data Volume Reduction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要