A Test for Evaluating Performance in Human-AI Systems

Research Square (Research Square)(2023)

引用 0|浏览0
暂无评分
摘要
Many important uses of AI involve augmenting humans, not replacing them. But there is not yet a widely used and broadly comparable test for evaluating the performance of these human-AI systems relative to humans alone, AI alone, or other baselines. Here we describe such a test and demonstrate its use in three ways. First, in an analysis of 79 recently published results, we find that, surprisingly, the median performance improvement ratio corresponds to no improvement at all, and the maximum improvement is only 36%. Second, we experimentally find a 27% performance improvement when 100 human programmers develop software using GPT-3, a modern, generative AI system. Finally, we find that 50 human non-programmers using GPT-3 perform the task about as well as –- and less expensively than –- the human programmers. Since neither the non-programmers nor the computer could perform the task alone, this illustrates a strong form of human-AI synergy.
更多
查看译文
关键词
performance,evaluating,test
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要