Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis

2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW)(2023)

引用 0|浏览17
暂无评分
摘要
The ubiquitous adoption of Large Language Generation Models (LLMs) in programming has highlighted the importance of distinguishing between human-written code and code generated by intelligent models. This paper specifically aims to distinguish ChatGPT-generated code from human-generated code. Our investigation reveals differences in programming style, technical level and readability between these two sources. Consequently, we develop a discriminative feature set for differentiation and evaluate its effectiveness through ablation experiments. In addition, we develop a dataset cleaning technique using temporal and spatial segmentation to mitigate dataset scarcity and ensure high quality, uncontaminated datasets. To further enrich the data resources, we apply "code transformation", "feature transformation" and "feature adaptation" techniques, generating a rich dataset of 100,000 lines of ChatGPT-generated code. The main contributions of our research include: proposing a discriminative feature set that yields high accuracy in distinguishing ChatGPT-generated code from human-authored code in binary classification tasks; devising methods for generating rich ChatGPT-generated code; and introducing a dataset cleansing strategy that extracts pristine, high-quality code datasets from open-source repositories, thereby achieving exceptional accuracy in code authorship attribution tasks.
更多
查看译文
关键词
ChatGPT, Code Differentiation, Dataset Cleansing, Machine Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要