Chuweb21D: A Deduped English Document Collection for Web Search Tasks.

SIGIR-AP '23: Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(2023)

引用 0|浏览17
暂无评分
摘要
As a traditional information retrieval task, ad hoc web search has long been an important part of IR research and evaluation tracks (e.g. TREC, NTCIR and CLEF). A crawled, large-scale web document collection is a central component to offline web search evaluation. Although there already exist several English document collections, such as the ClueWeb series, GOV2 and c4, a collection that satisfies properties of both strong timeliness and raw HTML formatting is still relatively scarce. To better support the demands of nascent web search tasks, we have built and publicly released Chuweb21D, a large-scale deduped English document collection for web search tasks. The Chuweb21D collection is derived from Chuweb21, which we released in April 2021 as a target corpus for the NTCIR-16 WWW-4 Task. We applied two different deduping thresholds to obtain two versions of Chuweb21D, called Chuweb21D-60 and Chuweb21D-70; the former is used as the target corpus for the ongoing NTCIR-17 FairWeb-1 task. To gain an insight into the impact of deduping, we evaluate the runs submitted to the NTCIR-16 WWW-4 task using Chuweb21D, and compare the outcome with the official results that used the corpus before deduping.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要