ArcMMLU: A Library and Information Science Benchmark for Large Language Models
CoRR(2023)
摘要
In light of the rapidly evolving capabilities of large language models
(LLMs), it becomes imperative to develop rigorous domain-specific evaluation
benchmarks to accurately assess their capabilities. In response to this need,
this paper introduces ArcMMLU, a specialized benchmark tailored for the Library
& Information Science (LIS) domain in Chinese. This benchmark aims to measure
the knowledge and reasoning capability of LLMs within four key sub-domains:
Archival Science, Data Science, Library Science, and Information Science.
Following the format of MMLU/CMMLU, we collected over 6,000 high-quality
questions for the compilation of ArcMMLU. This extensive compilation can
reflect the diverse nature of the LIS domain and offer a robust foundation for
LLM evaluation. Our comprehensive evaluation reveals that while most mainstream
LLMs achieve an average accuracy rate above 50% on ArcMMLU, there remains a
notable performance gap, suggesting substantial headroom for refinement in LLM
capabilities within the LIS domain. Further analysis explores the effectiveness
of few-shot examples on model performance and highlights challenging questions
where models consistently underperform, providing valuable insights for
targeted improvements. ArcMMLU fills a critical gap in LLM evaluations within
the Chinese LIS domain and paves the way for future development of LLMs
tailored to this specialized area.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要