Hybrid Crowd-Machine Wrapper Inference

ACM Transactions on Knowledge Discovery from Data (TKDD)(2019)

引用 2|浏览101
暂无评分
摘要
Wrapper inference deals in generating programs to extract data from Web pages. Several supervised and unsupervised wrapper inference approaches have been proposed in the literature. On one hand, unsupervised approaches produce erratic wrappers: whenever the sources do not satisfy underlying assumptions of the inference algorithm, their accuracy is compromised. On the other hand, supervised approaches produce accurate wrappers, but since they need training data, their scalability is limited. The recent advent of crowdsourcing platforms has opened new opportunities for supervised approaches, as they make possible the production of large amounts of training data with the support of workers recruited online. Nevertheless, involving human workers has monetary costs. We present an original hybrid crowd-machine wrapper inference system that offers the benefits of both approaches exploiting the cooperation of crowd workers and unsupervised algorithms. Based on a principled probabilistic model that estimates the quality of wrappers, humans workers are recruited only when unsupervised wrapper induction algorithms are not able to produce sufficiently accurate solutions.
更多
查看译文
关键词
Crowdsourcing, data extraction, wrapper inference
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要