Learned Best-Effort LLM Serving
CoRR(2024)
摘要
Many applications must provide low-latency LLM service to users or risk
unacceptable user experience. However, over-provisioning resources to serve
fluctuating request patterns is often prohibitively expensive. In this work, we
present a best-effort serving system that employs deep reinforcement learning
to adjust service quality based on the task distribution and system load. Our
best-effort system can maintain availability with over 10x higher client
request rates, serves above 96
above 98
unpredictable workloads. Our learned router is robust to shifts in both the
arrival and task distribution. Compared to static serving, learned best-effort
serving allows for cost-efficient serving through increased hardware utility.
Additionally, we argue that learned best-effort LLM serving is applicable in
wide variety of settings and provides application developers great flexibility
to meet their specific needs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要