Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting
CoRR(2024)
摘要
Large Language Models (LLMs) have exhibited remarkable proficiency in
generating code. However, the misuse of LLM-generated (Synthetic) code has
prompted concerns within both educational and industrial domains, highlighting
the imperative need for the development of synthetic code detectors. Existing
methods for detecting LLM-generated content are primarily tailored for general
text and often struggle with code content due to the distinct grammatical
structure of programming languages and massive "low-entropy" tokens. Building
upon this, our work proposes a novel zero-shot synthetic code detector based on
the similarity between the code and its rewritten variants. Our method relies
on the intuition that the differences between the LLM-rewritten and original
codes tend to be smaller when the original code is synthetic. We utilize
self-supervised contrastive learning to train a code similarity model and
assess our approach on two synthetic code detection benchmarks. Our results
demonstrate a notable enhancement over existing synthetic content detectors
designed for general texts, with an improvement of 20.5
and 29.1
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要