Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers
arxiv(2024)
摘要
We evaluate recent Large language Models (LLMs) on the challenging task of
summarizing short stories, which can be lengthy, and include nuanced subtext or
scrambled timelines. Importantly, we work directly with authors to ensure that
the stories have not been shared online (and therefore are unseen by the
models), and to obtain informed evaluations of summary quality using judgments
from the authors themselves. Through quantitative and qualitative analysis
grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We
find that all three models make faithfulness mistakes in over 50
and struggle to interpret difficult subtext. However, at their best, the models
can provide thoughtful thematic analysis of stories. We additionally
demonstrate that LLM judgments of summary quality do not match the feedback
from the writers.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要