OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory

Cheng Chen, Yilin Wang,Jun Yang, Yiming Liu,Mian Lu,Zhao Zheng,Bingsheng He,Weng-Fai Wong,Liang You, Penghao Sun,Yuping Zhao, Fenghua Hu,Andy Rudoff

ICDE（2023）

引用 0|浏览32

暂无评分

摘要

In this paper, we present OpenEmbedding, a distributed parameter server system for deep learning recommendation models (DLRM) workloads. In order to support rapid growth in the number of features and the model size (Terabytes are common) of DLRM workloads, OpenEmbedding takes advantage of emerging persistent memory (PMem) to address scalability and reliability issues in training DLRMs. Compared to DRAM, PMem can have much lower per-GB cost, higher density, and non-volatility, while with slightly low access performance to DRAM. OpenEmbedding uses DRAM as cache and PMem as storage for the sparse features and develops a simple but effective pipeline processing approach to optimize the access latency of the sparse features in PMem. For reliability, we develop a lightweight synchronous checkpointing scheme that is specially co-designed with the pipelined cache to reduce the run-time overhead of checkpointing. Our evaluations on a real-world industry workload consisting of billions of parameters demonstrate 1) the effectiveness of our PMem-aware optimizations, 2) checkpointing mechanism with near-zero run-time overhead to the training performance and 3) fast recovery with up to 3.97× speedup compared to the state-of-the-art. OpenEmbedding has been deployed in hundreds of scenarios in industry within 4Paradigm, and is open-sourced ¹ .

查看译文

关键词

machine learning system,recommendation model,parameter server,persistent memory

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要