Autonomous, failure-resilient orchestration of distributed discrete event simulations

CAC '13: Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference(2013)

引用 5|浏览0
暂无评分
摘要
Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes. In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.
更多
查看译文
关键词
production discrete event simulation,accurate model,discrete event simulation,relevant event,fault-tolerant execution,failure-resilient orchestration,fault tolerance functionality,discrete event simulations model,fault tolerance policy,fault tolerance scheme,fault tolerance framework,prediction,neural networks,fault tolerance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要