Improving Congestion Control through Fine-Grain Monitoring of InfiniBand Networks

2022 IEEE Symposium on High-Performance Interconnects (HOTI)(2022)

引用 0|浏览17
暂无评分
摘要
Congestion situations are a serious threat to the performance of the interconnection networks of High-Performance Computing and Data-Center systems. Hence, the specifications of the main interconnect technologies, such as InfiniBand, define some mechanisms to deal with congestion and its effects. However, these standard mechanisms may not be suitable to detect or track accurately the actual status of network congestion, as congestion dynamics indeed can be very complex and varied. Moreover, achieving an optimal configuration of the parameters that drive the different functionalities of congestion-control mechanisms is often a difficult task, as some configurations may be suitable for some traffic scenarios, but not for others. In this paper, we propose combining an existing light-weight platform monitoring tool (LIMITLESS) with the InfiniBand control software (OpenSM), such that the metrics about communication volumes in the network provided by the former allow the latter having a more precise image of congestion status, then being able to react more efficiently in these situations. The main contributions of this paper are the methodology to link the monitor and OpenSM, as well as modifications in the InfiniBand standard congestion-control mechanism so that its reaction is modulated based on the enhanced knowledge about congestion provided by the monitor. These improvements are ready to be integrated into any InfiniBand-based system. According to the results from our experiments (performed in a real InfiniBand-based cluster where we run a widely used benchmark), the proposed approach reduces significantly the number of wrong detections of congestion, and so the number of times that the congestion-control mechanisms react unnecessarily, hence improving system performance up to 74%. The overhead of this monitoring tool is 0.1% in our experiments, collecting data each 200ms.
更多
查看译文
关键词
Interconnection networks,cluster,congestion control,traffic monitoring,InfiniBand
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要