Reference-based Metrics Disprove Themselves in Question Generation
arxiv(2024)
摘要
Reference-based metrics such as BLEU and BERTScore are widely used to
evaluate question generation (QG). In this study, on QG benchmarks such as
SQuAD and HotpotQA, we find that using human-written references cannot
guarantee the effectiveness of the reference-based metrics. Most QG benchmarks
have only one reference; we replicated the annotation process and collect
another reference. A good metric was expected to grade a human-validated
question no worse than generated questions. However, the results of
reference-based metrics on our newly collected reference disproved the metrics
themselves. We propose a reference-free metric consisted of multi-dimensional
criteria such as naturalness, answerability, and complexity, utilizing large
language models. These criteria are not constrained to the syntactic or
semantic of a single reference question, and the metric does not require a
diverse set of references. Experiments reveal that our metric accurately
distinguishes between high-quality questions and flawed ones, and achieves
state-of-the-art alignment with human judgment.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要