Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities
arxiv(2024)
摘要
With the rise of Large Language Models (LLMs), AI assistants' ability to
utilize tools, especially through API calls, has advanced notably. This
progress has necessitated more accurate evaluation methods. Many existing
studies adopt static evaluation, where they assess AI assistants' API call
based on pre-defined dialogue histories. However, such evaluation method can be
misleading, as an AI assistant might fail in generating API calls from
preceding human interaction in real cases. Instead of the resource-intensive
method of direct human-machine interactions, we propose Automated Dynamic
Evaluation (AutoDE) to assess an assistant's API call capability without human
involvement. In our framework, we endeavor to closely mirror genuine human
conversation patterns in human-machine interactions, using a LLM-based user
agent, equipped with a user script to ensure human alignment. Experimental
results highlight that AutoDE uncovers errors overlooked by static evaluations,
aligning more closely with human assessment. Testing four AI assistants using
our crafted benchmark, our method mirrored human evaluation with an correlation
of 0.99, marking an 8
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要