Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
NAIC(2024)
摘要
To render each generated token in real-time for users, the Large Language
Model (LLM) server generates tokens one by one and streams each token (or group
of a few tokens) through the network to the user right after generation, which
we refer to as LLM token streaming. However, under unstable network conditions,
the LLM token streaming experience could suffer greatly from stalls since one
packet loss could block the rendering of later tokens even if the packets
containing them arrive on time. With a measurement study, we show that current
applications suffer from increased stalls under unstable networks.
For this emerging token streaming problem in LLM Chatbots that differs from
previous multimedia and text applications, we propose a novel transmission
scheme, called Eloquent, which puts newly generated tokens as well as currently
unacknowledged tokens in the next outgoing packet. This ensures that each
packet contains some new tokens and, in the meantime, is independently rendered
when received, avoiding the aforementioned stalls caused by missing packets.
Through simulation under various networks, we show Eloquent reduces stall ratio
(proportion of token rendering wait time) by 71.0
retransmission method commonly used by real chatbot applications and by 31.6
compared to the baseline packet duplication scheme. By tailoring Eloquent to
fit the token-by-token generation of LLM, we enable the Chatbots to respond
like an eloquent speaker for users to better enjoy pervasive AI.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要