PRobELM: Plausibility Ranking Evaluation for Language Models
arxiv(2024)
摘要
This paper introduces PRobELM (Plausibility Ranking Evaluation for Language
Models), a benchmark designed to assess language models' ability to discern
more plausible from less plausible scenarios through their parametric
knowledge. While benchmarks such as TruthfulQA emphasise factual accuracy or
truthfulness, and others such as COPA explore plausible scenarios without
explicitly incorporating world knowledge, PRobELM seeks to bridge this gap by
evaluating models' capabilities to prioritise plausible scenarios that leverage
world knowledge over less plausible alternatives. This design allows us to
assess the potential of language models for downstream use cases such as
literature-based discovery where the focus is on identifying information that
is likely but not yet known. Our benchmark is constructed from a dataset
curated from Wikidata edit histories, tailored to align the temporal bounds of
the training data for the evaluated models. PRobELM facilitates the evaluation
of language models across multiple prompting types, including statement, text
completion, and question-answering. Experiments with 10 models of various sizes
and architectures on the relationship between model scales, training recency,
and plausibility performance, reveal that factual accuracy does not directly
correlate with plausibility performance and that up-to-date training data
enhances plausibility assessment across different model architectures.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要