Limits of Detecting Text Generated by Large-Scale Language Models

2020 Information Theory and Applications Workshop (ITA)(2020)

引用 4|浏览281
暂无评分
摘要
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated. We show that error exponents for particular language models are bounded in terms of their perplexity, a standard measure of language generation performance. Under the assumption that human language is stationary and ergodic, the formulation is ex-tended from considering specific language models to considering maximum likelihood language models, among the class of k-order Markov approximations; error probabilities are characterized. Some discussion of incorporating semantic side information is also given.
更多
查看译文
关键词
large-scale language model output detection,language generation performance,human language,maximum likelihood language models,text detection,k-order Markov approximations,error probabilities,semantic side information
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要