Text classification of column headers with a controlled vocabulary: leveraging LLMs for metadata enrichment
CoRR(2024)
摘要
Traditional dataset retrieval systems index on metadata information rather
than on the data values. Thus relying primarily on manual annotations and
high-quality metadata, processes known to be labour-intensive and challenging
to automate. We propose a method to support metadata enrichment with topic
annotations of column headers using three Large Language Models (LLMs):
ChatGPT-3.5, GoogleBard and GoogleGemini. We investigate the LLMs ability to
classify column headers based on domain-specific topics from a controlled
vocabulary. We evaluate our approach by assessing the internal consistency of
the LLMs, the inter-machine alignment, and the human-machine agreement for the
topic classification task. Additionally, we investigate the impact of
contextual information (i.e. dataset description) on the classification
outcomes. Our results suggest that ChatGPT and GoogleGemini outperform
GoogleBard for internal consistency as well as LLM-human-alignment.
Interestingly, we found that context had no impact on the LLMs performances.
This work proposes a novel approach that leverages LLMs for text classification
using a controlled topic vocabulary, which has the potential to facilitate
automated metadata enrichment, thereby enhancing dataset retrieval and the
Findability, Accessibility, Interoperability and Reusability (FAIR) of research
data on the Web.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要