BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability
CoRR(2023)
摘要
While there are numerous benchmarks comparing the performance of modern
language models (LMs), end-task evaluations often conflate notions of *factual
accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the
sense of correctly reporting implications of beliefs). Our goal is a dataset
that clearly distinguishes these two notions. Our approach is to leverage and
extend a collection of human-annotated *entailment trees*, engineered to
express both good and bad chains of reasoning, and using a mixture of true and
false facts, in particular including counterfactual examples, to avoid belief
bias (also known as the "content effect"). The resulting dataset, called BaRDa,
contains 3000 entailments (1787 valid, 1213 invalid), using 6681 true and 2319
false statements. Testing on four GPT-series models,
GPT3(curie)/GPT3(davinici)/3.5/4, we find factual accuracy (truth) scores of
74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This
shows the clear progression of models towards improved factual accuracy and
entailment reasoning, and the dataset provides a new benchmark that more
cleanly separates and quantifies these two notions.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要