PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models
CoRR(2024)
摘要
Large Language Models (LLMs) have exhibited remarkable success in long-form
context comprehension tasks. However, their capacity to generate long contents,
such as reports and articles, remains insufficiently explored. Current
benchmarks do not adequately assess LLMs' ability to produce informative and
comprehensive content, necessitating a more rigorous evaluation approach. In
this study, we introduce ProxyQA, a framework for evaluating long-form
text generation, comprising in-depth human-curated meta-questions
spanning various domains. Each meta-question contains corresponding
proxy-questions with annotated answers. LLMs are prompted to generate
extensive content in response to these meta-questions. Utilizing an evaluator
and incorporating generated content as background context, ProxyQA
evaluates the quality of generated content based on the evaluator's performance
in answering the proxy-questions. We examine multiple LLMs,
emphasizing ProxyQA's demanding nature as a high-quality assessment
tool. Human evaluation demonstrates that evaluating through
proxy-questions is a highly self-consistent and
human-criteria-correlated validation method. The dataset and leaderboard will
be available at .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要