Are LLM-based Evaluators Confusing NLG Quality Criteria?
CoRR(2024)
摘要
Some prior work has shown that LLMs perform well in NLG evaluation for
different tasks. However, we discover that LLMs seem to confuse different
evaluation criteria, which reduces their reliability. For further verification,
we first consider avoiding issues of inconsistent conceptualization and vague
expression in existing NLG quality criteria themselves. So we summarize a clear
hierarchical classification system for 11 common aspects with corresponding
different criteria from previous studies involved. Inspired by behavioral
testing, we elaborately design 18 types of aspect-targeted perturbation attacks
for fine-grained analysis of the evaluation behaviors of different LLMs. We
also conduct human annotations beyond the guidance of the classification system
to validate the impact of the perturbations. Our experimental results reveal
confusion issues inherent in LLMs, as well as other noteworthy phenomena, and
necessitate further research and improvements for LLM-based evaluation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要