谷歌浏览器插件
订阅小程序
在清言上使用

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai,Chengqi Deng,Chenggang Zhao, R. X. Xu,Huazuo Gao,Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu,Zhenda Xie,Y. K. Li, Panpan Huang,Fuli Luo,Chong Ruan,Zhifang Sui,Wenfeng Liang

Annual Meeting of the Association for Computational Linguistics(2024)

引用 0|浏览148
暂无评分
摘要
In the era of large language models, Mixture-of-Experts (MoE) is a promisingarchitecture for managing computational costs when scaling up model parameters.However, conventional MoE architectures like GShard, which activate the top-Kout of N experts, face challenges in ensuring expert specialization, i.e.each expert acquires non-overlapping and focused knowledge. In response, wepropose the DeepSeekMoE architecture towards ultimate expert specialization. Itinvolves two principal strategies: (1) finely segmenting the experts into mNones and activating mK from them, allowing for a more flexible combination ofactivated experts; (2) isolating K_s experts as shared ones, aiming atcapturing common knowledge and mitigating redundancy in routed experts.Starting from a modest scale with 2B parameters, we demonstrate thatDeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5times the expert parameters and computation. In addition, DeepSeekMoE 2B nearlyapproaches the performance of its dense counterpart with the same number oftotal parameters, which set the upper bound of MoE models. Subsequently, wescale up DeepSeekMoE to 16B parameters and show that it achieves comparableperformance with LLaMA2 7B, with only about 40preliminary efforts to scale up DeepSeekMoE to 145B parameters consistentlyvalidate its substantial advantages over the GShard architecture, and show itsperformance comparable with DeepSeek 67B, using only 28.5of computations.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要