DETECTING MALICIOUS PDF DOCUMENTS USING SEMI-SUPERVISED MACHINE LEARNING

Jianguo Jiang, Nan Song,Min Yu,Kam-Pui Chow,Gang Li,Chao Liu,Weiqing Huang

ADVANCES IN DIGITAL FORENSICS XVII（2021）

引用 4|浏览29

暂无评分

摘要

Portable Document Format (PDF) documents are often used as carriers of malicious code that launch attacks or steal personal information. Traditional manual and supervised-learning-based detection methods rely heavily on labeled samples of malicious documents. But this is problematic because very few labeled malicious samples are available in real-world scenarios. This chapter presents a semi-supervised machine learning method for detecting malicious PDF documents. It extracts structural features as well as statistical features based on entropy sequences using the wavelet energy spectrum. A random sub-sampling strategy is employed to train multiple sub-classifiers. Each classifier is independent, which enhances the generalization capability during detection. The semi-supervised learning method enables labeled as well as unlabeled samples to be used to classify malicious and benign PDF documents. Experimental results demonstrate that the method yields an accuracy of 94% despite using training data with just 11% labeled malicious samples.

查看译文

关键词

Malicious PDF documents, machine learning, semi-supervised learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要