Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis Denver CO USA November, 2017(2017)

引用 11|浏览7
暂无评分
摘要
As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要