High-Efficient Fuzzy Querying With HiveQL for Big Data Warehousing

IEEE Transactions on Fuzzy Systems(2022)

引用 5|浏览48
暂无评分
摘要
Querying and reporting from large volumes of structured, semistructured, and unstructured data often requires some flexibility. This flexibility provided by fuzzy sets allows for categorization of the surrounding world in a flexible, human-mind-like manner. Apache Hive is a data warehousing framework working on top of the Hadoop platform for big data processing. Hive allows executing queries and aggregating and analyzing data stored in Hadoop distributed file system and other repositories. Hive responds to the current needs for efficient big data warehousing, which is impossible with traditional data warehouses due to their rigid nature. This article presents the FuzzyHive library that extends the Hive framework with fuzzy sets based techniques for querying, analyzing, and reporting on big data warehouses. We formalize the fuzzy techniques used while operating on Hive-based data warehouses (including fuzzy filtering on dimensional attributes, projection with fuzzy transformation, fuzzy grouping, and joining). We also show how we embedded these operations in Hive query language, which was not studied so far. Such extensions make big data warehousing more flexible and contribute to the portfolio of tools used by the community of people working with fuzzy sets and data analysis. The FuzzyHive library complements the spectrum of available solutions for fuzzy data processing and querying in large datasets. We investigate Hive fuzzy querying performance, effectiveness, and scalability for various data storage formats (text, Avro, and Parquet). Our experiments demonstrate that the proposed extensions introduce more elasticity and are also efficient for big data warehousing, which is the first such kind of solution for this environment.
更多
查看译文
关键词
Big data,data warehousing,fuzzy sets,Hadoop,Hive,querying
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要