Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT
CoRR(2024)
摘要
Retrieval pipelines-an integral component of many machine learning
systems-perform poorly in domains where documents are long (e.g., 10K tokens or
more) and where identifying the relevant document requires synthesizing
information across the entire text. Developing long-context retrieval encoders
suitable for these domains raises three challenges: (1) how to evaluate
long-context retrieval performance, (2) how to pretrain a base language model
to represent both short contexts (corresponding to queries) and long contexts
(corresponding to documents), and (3) how to fine-tune this model for retrieval
under the batch size limitations imposed by GPU memory constraints. To address
these challenges, we first introduce LoCoV1, a novel 12 task benchmark
constructed to measure long-context retrieval where chunking is not possible or
not effective. We next present the M2-BERT retrieval encoder, an 80M parameter
state-space encoder model built from the Monarch Mixer architecture, capable of
scaling to documents up to 32K tokens long. We describe a pretraining data
mixture which allows this encoder to process both short and long context
sequences, and a finetuning approach that adapts this base model to retrieval
with only single-sample batches. Finally, we validate the M2-BERT retrieval
encoder on LoCoV1, finding that it outperforms competitive baselines by up to
23.3 points, despite containing 5-90x fewer parameters.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要