Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams
arxiv(2024)
摘要
The proliferation of textual data on the Internet presents a unique
opportunity for institutions and companies to monitor public opinion about
their services and products. Given the rapid generation of such data, the text
stream mining setting, which handles sequentially arriving, potentially
infinite text streams, is often more suitable than traditional batch learning.
While pre-trained language models are commonly employed for their high-quality
text vectorization capabilities in streaming contexts, they face challenges
adapting to concept drift - the phenomenon where the data distribution changes
over time, adversely affecting model performance. Addressing the issue of
concept drift, this study explores the efficacy of seven text sampling methods
designed to selectively fine-tune language models, thereby mitigating
performance degradation. We precisely assess the impact of these methods on
fine-tuning the SBERT model using four different loss functions. Our
evaluation, focused on Macro F1-score and elapsed time, employs two text stream
datasets and an incremental SVM classifier to benchmark performance. Our
findings indicate that Softmax loss and Batch All Triplets loss are
particularly effective for text stream classification, demonstrating that
larger sample sizes generally correlate with improved macro F1-scores. Notably,
our proposed WordPieceToken ratio sampling method significantly enhances
performance with the identified loss functions, surpassing baseline results.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要