Using LLMs for the Extraction and Normalization of Product Attribute Values
CoRR(2024)
摘要
Product offers on e-commerce websites often consist of a textual product
title and a textual product description. In order to provide features such as
faceted product filtering or content-based product recommendation, the websites
need to extract attribute-value pairs from the unstructured product
descriptions. This paper explores the potential of using large language models
(LLMs), such as OpenAI's GPT-3.5 and GPT-4, to extract and normalize attribute
values from product titles and product descriptions. For our experiments, we
introduce the WDC Product Attribute-Value Extraction (WDC PAVE) dataset. WDC
PAVE consists of product offers from 87 websites that provide schema.org
annotations. The offers belong to five different categories, each featuring a
specific set of attributes. The dataset provides manually verified
attribute-value pairs in two forms: (i) directly extracted values and (ii)
normalized attribute values. The normalization of the attribute values requires
systems to perform the following types of operations: name expansion,
generalization, unit of measurement normalization, and string wrangling. Our
experiments demonstrate that GPT-4 outperforms PLM-based extraction methods by
10
product attribute values, GPT-4 achieves a similar performance to the
extraction scenario, while being particularly strong at string wrangling and
name expansion.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要