DySched: Relieving Large-Scale Incast for Cloud-Native RDMA Applications.

Parallel and Distributed Processing with Applications(2023)

引用 0|浏览0
暂无评分
摘要
Remote Direct Memory Access (RDMA), with its advantages of large throughput and low latency, has been widely deployed for applications like cloud storage and distributed machine-learning training. However, in large-scale RDMA networks, congestion caused by incast often leads to severe performance degradation of throughput and latency. Different from the traditional data centers, the novel cloud-native infrastructure brings new challenges to resolve the incast issues: the underlying network fabric/configurations are managed by the cloud vendors and the cloud-native applications running on the virtualized/containerized platform are agnostic to the network.We design DySched, a novel middleware to relieve the largescale incast issues of RDMA applications, without any dependencies on network fabric/configuration (topology, flow control, congestion notification) in a cloud-native environment. DySched creatively utilizes the inherent time cost of posting RDMA requests to determine the inflight data allocation of each RDMA connection. DySched thus achieves an ideal full bandwidth utilization and no congestion at the receiver side. DySched has been deployed in two typical clusters in Alibaba Cloud. Results show that DySched improves throughput by 22% to 60% and reduces job completion time by 4% to 67% compared to state-of-the-art methods.
更多
查看译文
关键词
RDMA,incast,traffic control,cloud-native
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要