The Web Data Commons Schema.org Data Set Series

Alexander Brinkmann,Anna Primpeli,Christian Bizer

COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023（2023）

引用 3|浏览18

暂无评分

摘要

Millions of websites have started to annotate structured data within their HTML pages using the schema.org vocabulary. Popular entity types annotated with schema.org terms are products, local businesses, events, and job postings. The Web Data Commons project has been extracting schema.org data from the Common Crawl every year since 2013 and offers the extracted data for public download in the form of the schema.org data set series. The latest release in the series consists of 106 billion RDF quads describing 3.1 billion entities. The entity descriptions originate from 12.8 million different websites. From a Web Science perspective, the data set series lays the foundation for analyzing the adoption process of schema.org annotations on the Web over the past decade. From a machine learning perspective, the annotations provide a large pool of training data for tasks such as product matching, product or job categorization, information extraction, or question answering. This poster gives an overview of the content of the Web Data Commons schema.org data set series. It highlights trends in the adoption of schema.org annotations on the Web and discusses how the annotations are being used as training data for machine learning applications.

查看译文

关键词

information extraction,semantic annotations,schema.org,web science

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要