CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks
CoRR(2024)
摘要
Knowledge-intensive language tasks (KILTs) typically require retrieving
relevant documents from trustworthy corpora, e.g., Wikipedia, to produce
specific answers. Very recently, a pre-trained generative retrieval model for
KILTs, named CorpusBrain, was proposed and reached new state-of-the-art
retrieval performance. However, most existing research on KILTs, including
CorpusBrain, has predominantly focused on a static document collection,
overlooking the dynamic nature of real-world scenarios, where new documents are
continuously being incorporated into the source corpus. To address this gap, it
is crucial to explore the capability of retrieval models to effectively handle
the dynamic retrieval scenario inherent in KILTs.
In this work, we first introduce the continual document learning (CDL) task
for KILTs and build a novel benchmark dataset named KILT++ based on the
original KILT dataset for evaluation. Then, we conduct a comprehensive study
over the use of pre-trained CorpusBrain on KILT++. Unlike the promising results
in the stationary scenario, CorpusBrain is prone to catastrophic forgetting in
the dynamic scenario, hence hampering the retrieval performance. To alleviate
this issue, we propose CorpusBrain++, a continual generative pre-training
framework. Empirical results demonstrate the significant effectiveness and
remarkable efficiency of CorpusBrain++ in comparison to both traditional and
generative IR methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要