Extraction of Phrase-based Concepts in Vulnerability Descriptions through Unsupervised Labeling

Sofonias Yitagesu,Zhenchang Xing,Xiaowang Zhang,Zhiyong Feng,Xiaohong Li,Linyi Han

ACM Transactions on Software Engineering and Methodology（2023）

引用 1|浏览28

暂无评分

摘要

Software vulnerabilities, once disclosed, can be documented in vulnerability databases, which have great potential to advance vulnerability analysis and security research. People describe the key characteristics of software vulnerabilities in natural language mixed with domain-specific names and concepts. This textual nature poses a significant challenge for the automatic analysis of vulnerability knowledge embedded in text. Automatic extraction of key vulnerability aspects is highly desirable but demands significant effort to manually label data for model training. In this paper, we propose unsupervised methods to label and extract important vulnerability concepts in textual vulnerability descriptions (TVDs). We focus on six types of phrase-based vulnerability concepts (vulnerability type, vulnerable component, root cause, attacker type, impact, attack vector) as they are much more difficult to label and extract than name- or number-based entities (i.e., vendor, product, and version). Our approach is based on a key observation that the same-type of phrases, no matter how they differ in sentence structures and phrase expressions, usually share syntactically similar paths in the sentence parsing trees. Specifically, we present a source-target neural architecture that learns the Part-of-Speech (POS) tagging to identify a token’s functional role within TVDs, where the source neural model is trained to capture common features found in the TVD corpus, and the target model is trained to identify linguistically malformed words specific to the security domain. Our evaluation confirms that the proposed tagger outperforms (4.45%–5.98%) the taggers designed on natural language notions and identifies a broad set of TVDs and natural language contents. Then, based on the key observations, we propose two path representations (absolute paths and relative paths) and use an auto-encoder to encode such syntactic similarities. To address the discrete nature of our paths, we enhance the traditional Variational Auto-encoder (VAE) with Gumble-Max trick for categorical data distribution and thus create a Categorical VAE (CaVAE). In the latent space of absolute and relative paths, we further apply unsupervised clustering techniques to generate clusters of the same-type of concepts. Our evaluation confirms the effectiveness of our CaVAE, which achieves a small (85.85) log-likelihood for encoding path representations and the accuracy (83%–89%) of vulnerability concepts in the resulting clusters. The resulting clusters accurately label six types of vulnerability concepts from a TVD corpus in an unsupervised way. Furthermore, these labeled vulnerability concepts can be mapped back to the corresponding phrases in the original TVDs, which produce labels of six types of vulnerability concepts. The resulting labeled TVDs can be used to train concept extraction models for other TVD corpora. In this work, we present two concept extraction methods (concept classification and sequence labeling model) to demonstrate the utility of the unsupervisedly labeled concepts. Our study shows that models trained with our unsupervisedly labeled vulnerability concepts outperform (3.9%–5.14%) those trained with the two manually labeled TVD datasets from previous work due to the consistent boundary and typing by our unsupervised labeling method.

查看译文

关键词

Textual vulnerability descriptions,phrase-based vulnerability concepts,unsupervised representation learning,clustering and concept labeling,supervised concept extraction

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要