SCS-Gan: Learning Functionality-Agnostic Stylometric Representations for Source Code Authorship Verification

Weihan Ou,Steven H. H. Ding,Yuan Tian,Leo Song

IEEE Transactions on Software Engineering（2023）

引用 5|浏览6

暂无评分

摘要

In recent years, the number of anonymous script-based fileless malware attacks and software copyright disputes has increased rapidly. In the literature, automated Code Authorship Analysis (CAA) techniques have been proposed to reduce the manual effort in identifying those attacks and issues. Most CAA techniques aim to solve the task of Authorship Attribution (AA), i.e., identifying the actual author of a source code fragment from a given set of candidate authors. However, in many real-world scenarios, investigators do not have a predefined set of authors containing the actual author at the time of investigation, i.e., contradicting AA's assumption. Additionally, existing AA techniques ignore the influence of code functionality when identifying the authorship, which leads to biased matching simply based on code functionality. Different from AA, the task of (extreme) Authorship Verification (AV) is to decide if two texts were written by the same person or not. AV techniques do not need a predefined author set and thus could be applied in more code authorship-related applications than AA. To our knowledge, there is no previous work attempting to solve the AV problem for the source code. To fill the gap, we propose a novel adversarial neural network, namely SCS-Gan, that can learn a stylometric representation of code for automated AV. With the multi-head attention mechanism, SCS-Gan focuses on the code parts that are most informative regarding personal styles and generates functionality-agnostic stylometric representations through adversarial training. We benchmark SCS-Gan and two state-of-the-art code representation models on four out-of-sample datasets collected from a real-world programming competition. Our experiment results show that SCS-Gan outperforms the baselines on all four out-of-sample datasets.

查看译文

关键词

Codes,Task analysis,Training,Encoding,Feature extraction,Malware,Python,Cyber threat intelligence,representation learning,adversarial learning,authorship analysis,code authorship verification

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要