MaGnn: Binary-Source Code Matching by Modality-Sharing Graph Convolution for Binary Provenance Analysis.

COMPSAC(2023)

引用 0|浏览0
暂无评分
摘要
The number and variety of binaries running on electrical devices, public clouds, and on-premise infrastructure have been increasing rapidly. Recent successful supply chain attacks indicate that even for binaries known to be developed by trustful developers, they can still contain malicious functionalities and copy-and-pasted vulnerabilities that pose security risks to operational systems and end users. By analyzing the origin of a target code, code provenance analysis helps to relieve such problem by revealing information about the origin of a binary sample such as the author or the included software bill-of-materials. Since in most cases source symbol information is removed during the compilation process, given a binary code sample, matching it to its corresponding source code could improve the accuracy and efficiency of the provenance analysis. Existing binary-source code matching methods focus on comparing manually selected code literals (e.g. the number of if/else statements). However, these methods suffer from the issue of generalizability and require significant manual efforts. Different from the previous methods, we propose a machine learning-based binary-source code matching system, MaGnn, which measures the consistency of an input binary-source code pair by automatically extracting high-dimensional feature representations of the input and calculating the functionality similarity. With the Siamese architecture that shares a unified encoder across two modalities, McGnn is able to calculate the similarity of the input binary-source code pair with the automatically-extracted functionality representations. With the graph convolution neural network as the representation encoder, MaGnn is able to learn and encode the functionality information of the input pairs from their graph features into high-dimensional representation vectors. We benchmark MaGnn with a state-of-the-art binary-source code matching method and two machinelearning models on six out-of-sample datasets collected from five real-world libraries. Our experiment results show that MaGnn outperforms the baselines on most out-of-sample datasets.
更多
查看译文
关键词
binary provenance, representation learning, binary source code matching, graph learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要