X-ray : Root-cause Diagnosis of Performance Anomalies in Production Software

semanticscholar(2011)

引用 0|浏览0
暂无评分
摘要
Understanding and troubleshooting performance problems in complex software systems is notoriously challenging. This challenge is compounded for software in production for several reasons. To avoid slowing down production systems, analysis and troubleshooting must incur minimal overhead. Further, performance issues in production can be both rare and non-deterministic, making the issues hard to reproduce. We argue that the most important reason why troubleshooting performance in production systems is challenging is that current tools only solve half the problem. Troubleshooting a performance anomaly is essentially the process of determining why certain events, such as high latency or resource usage, happened in a system. Unfortunately, most current analysis tools, such as profilers and logging [2, 5], only determine what events happened during a performance anomaly — they leave the more challenging question of why those events happened unanswered. Administrators and developers must manually infer the root cause of performance issue from the observed events based upon their expertise and knowledge of the software. This poster describes X-ray, a tool that uses performance summarization to determine what events occurred during a performance anomaly and also why the anomaly occurred. Performance summarization first attributes performance costs such as latency and I/O utilization to fine-grained events (individual instructions and system calls). Then, it uses dynamic information flow analysis to associate each such event with a set of probable root causes such as configuration settings or specific data from input requests. The cost of each event is assigned to potential root causes weighted by the relative probability that the particular root cause led to the execution of that event. Finally, the per-cause costs for all events in the program execution are summed together. The end result is a list of root causes ordered by their performance costs. X-ray can also compare the performance of two different requests. X-ray performs differential performance analysis to determine why two requests differed in performance. For instance, differential performance analysis can be used to understand why two requests to a Web server took different amounts of time to complete. Differential performance analysis identifies branches where the execution paths of the two requests diverged. It assigns a performance cost to each path taken from the branch, then uses dynamic information flow analysis to determine why the two requests diverged at that point. It attributes the difference in performance costs between the two paths to the identified root causes according to their relative likelihood. The costs of all such divergences are summed. The output of X-ray is a set of reasons of why the performance costs of two requests differ, along with a relative performance impact for each reason. Some prior research systems such as Spectroscope [3] also diagnose performance problems by comparing requests. Spectroscope, however, requires the requests to be the same; while X-ray is able to compare completely dissimilar requests and still identify the correct root cause. Performance summarization is a high-overhead activity. In order to execute this analysis for production software, X-ray uses deterministic replay to offload the heavyweight analysis from the production system. A deterministic replay system records the execution of the system so that an identical execution can later be replayed on demand. While many prior software systems provide this functionality [4], our use of deterministic replay to troubleshoot performance issues raised several new challenges. X-ray must split its functionality among the recorded and replayed executions; for example, timestamps must be captured during recording because the heavyweight analysis substantially perturbs timing. Further, because of the split analysis, the fidelity of the replay must be strict enough to guarantee that the two executions are identical at the granularity observed by X-ray. However, because the replayed execution includes analysis code that the recorded execution does not, the fidelity of the replay must be loose enough to allow the replayed execution to diverge enough to run the analysis. X-ray achieves these goals through careful codesign of the deterministic replay and analysis systems.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要