Measuring the Robustness of NLP Models to Domain Shifts
CoRR(2023)
摘要
Existing research on Domain Robustness (DR) suffers from disparate setups,
lack of task variety, and scarce research on recent models and capabilities
such as few-shot learning. Furthermore, we claim that the common practice of
measuring DR might further obscure the picture. Current research focuses on
challenge sets and relies solely on the Source Drop (SD): Using the source
in-domain performance as a reference point for degradation. However, the Target
Drop (TD) should be used as a complementary point of view. To understand the DR
challenge in modern NLP models, we developed a benchmark comprised of seven NLP
tasks, including classification, QA, and generation. Our benchmark focuses on
natural topical domain shifts and enables measuring both the SD and the TD. Our
comprehensive study, involving over 14,000 domain shifts across 18 fine-tuned
and few-shot models, shows that both models suffer from drops upon domain
shifts. While fine-tuned models excel in-domain, few-shot LLMs often surpass
them cross-domain, showing better robustness. In addition, we found that a
large SD can be explained by shifting to a harder domain rather than a genuine
DR challenge. Thus, the TD is a more reliable metric.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要