Can Large Language Models do Analytical Reasoning?
arxiv(2024)
摘要
This paper explores the cutting-edge Large Language Model with analytical
reasoning on sports. Our analytical reasoning embodies the tasks of letting
large language models count how many points each team scores in a quarter in
the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find
among all the models we employed, GPT-4 stands out in effectiveness, followed
by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind.
Specifically, we compare three different prompting techniques and a
divide-and-conquer approach, we find that the latter was the most effective.
Our divide-and-conquer approach breaks down play-by-play data into smaller,
more manageable segments, solves each piece individually, and then aggregates
them together. Besides the divide-and-conquer approach, we also explore the
Chain of Thought (CoT) strategy, which markedly improves outcomes for certain
models, notably GPT-4 and Claude-2.1, with their accuracy rates increasing
significantly. However, the CoT strategy has negligible or even detrimental
effects on the performance of other models like GPT-3.5 and Gemini-Pro.
Secondly, to our surprise, we observe that most models, including GPT-4,
struggle to accurately count the total scores for NBA quarters despite showing
strong performance in counting NFL quarter scores. This leads us to further
investigate the factors that impact the complexity of analytical reasoning
tasks with extensive experiments, through which we conclude that task
complexity depends on the length of context, the information density, and the
presence of related information. Our research provides valuable insights into
the complexity of analytical reasoning tasks and potential directions for
developing future large language models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要