F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods
CoRR(2024)
摘要
Large language models (LLMs) garner significant attention for their
unprecedented performance, leading to an increasing number of researches
evaluating LLMs. However, these evaluation benchmarks are limited to assessing
the instruction-following capabilities, overlooking the fundamental abilities
that emerge during the pre-training stage. Previous subjective evaluation
methods mainly reply on scoring by API models. However, in the absence of
references, large models have shown limited ability to discern subtle
differences. To bridge the gap, we propose F-Eval, a bilingual evaluation
benchmark to evaluate the fundamental abilities, including expression,
commonsense and logic. The tasks in F-Eval include multi-choice objective
tasks, open-ended objective tasks, reference-based subjective tasks and
reference-free subjective tasks. For reference-free subjective tasks, we devise
new evaluation methods, serving as alternatives to scoring by API models. We
conduct evaluations on 13 advanced LLMs. Results show that our evaluation
methods show higher correlation coefficients and larger distinction than other
evaluators. Additionally, we discuss the influence of different model sizes,
dimensions, and normalization methods. We anticipate that F-Eval will
facilitate the study of LLMs' fundamental abilities.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要