The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models
arxiv(2024)
摘要
In order to oversee advanced AI systems, it is important to understand their
underlying decision-making process. When prompted, large language models (LLMs)
can provide natural language explanations or reasoning traces that sound
plausible and receive high ratings from human annotators. However, it is
unclear to what extent these explanations are faithful, i.e., truly capture the
factors responsible for the model's predictions. In this work, we introduce
Correlational Explanatory Faithfulness (CEF), a metric that can be used in
faithfulness tests based on input interventions. Previous metrics used in such
tests take into account only binary changes in the predictions. Our metric
accounts for the total shift in the model's predicted label distribution, more
accurately reflecting the explanations' faithfulness. We then introduce the
Correlational Counterfactual Test (CCT) by instantiating CEF on the
Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the
faithfulness of free-text explanations generated by few-shot-prompted LLMs from
the Llama2 family on three NLP tasks. We find that our metric measures aspects
of faithfulness which the CT misses.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要