DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents
CoRR(2024)
摘要
Evaluation of large language models (LLMs) has raised great concerns in the
community due to the issue of data contamination. Existing work designed
evaluation protocols using well-defined algorithms for specific tasks, which
cannot be easily extended to diverse scenarios. Moreover, current evaluation
benchmarks can only provide the overall benchmark results and cannot support a
fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we
propose meta probing agents (MPA), a general dynamic evaluation protocol
inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal
2, which naturally extends the previous DyVal . MPA designs
the probing and judging agents to automatically transform an original
evaluation problem into a new one following psychometric theory on three basic
cognitive abilities: language understanding, problem solving, and domain
knowledge. These basic abilities are also dynamically configurable, allowing
multifaceted analysis. We conducted extensive evaluations using MPA and found
that most LLMs achieve poorer performance, indicating room for improvement. Our
multifaceted analysis demonstrated the strong correlation between the basic
abilities and an implicit Matthew effect on model size, i.e., larger models
possess stronger correlations of the abilities. MPA can also be used as a data
augmentation approach to enhance LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要