ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models
arxiv(2024)
摘要
Large language models (LLMs) have achieved unprecedented performance in
various applications, yet their evaluation remains a critical issue. Existing
hallucination benchmarks are either static or lack adjustable complexity for
thorough analysis. We contend that utilizing existing relational databases is a
promising approach for constructing benchmarks due to their accurate knowledge
description via functional dependencies. We propose ERBench to automatically
convert any relational database into a benchmark based on the
entity-relationship (ER) model. Our key idea is to construct questions using
the database schema, records, and functional dependencies such that they can be
automatically verified. In addition, we use foreign key constraints to join
relations and construct multihop questions, which can be arbitrarily complex
and used to debug the intermediate answers of LLMs. Finally, ERBench supports
continuous evaluation, multimodal questions, and various prompt engineering
techniques. In our experiments, we construct an LLM benchmark using databases
of multiple domains and make an extensive comparison of contemporary LLMs. We
observe that better LLMs like GPT-4 can handle a larger variety of question
types, but are by no means perfect. Also, correct answers do not necessarily
imply correct rationales, which is an important evaluation that ERBench does
better than other benchmarks for various question types. Code is available at
https: //github.com/DILAB-KAIST/ERBench.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要