Organizing Data Lakes for Navigation

SIGMOD/PODS '20: International Conference on Management of Data Portland OR USA June, 2020(2020)

引用 49|浏览169
暂无评分
摘要
We consider the problem of creating an effective navigation structure over a data lake. We define an organization as a navigation graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We propose the data lake organization problem as the problem of finding an organization that allows a user to most effectively navigate a data lake. We present a new probabilistic model of how users interact with an organization and propose an approximate algorithm for the data lake organization problem. We show the effectiveness of the algorithm on both a real data lake containing data from open data portals and on a benchmark that contains rich metadata emulating the observed characteristics of real data lakes. Through a formal user study, we show that navigation can help users find relevant tables that cannot be found by keyword search.
更多
查看译文
关键词
data lakes, dataset discovery and search, structure learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要