DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI
Conference of the European Chapter of the Association for Computational Linguistics(2023)
摘要
Despite advancements in conversational AI, language models encounter
challenges to handle diverse conversational tasks, and existing dialogue
dataset collections often lack diversity and comprehensiveness. To tackle these
issues, we introduce DialogStudio: the largest and most diverse collection of
dialogue datasets, unified under a consistent format while preserving their
original information. Our collection encompasses data from open-domain
dialogues, task-oriented dialogues, natural language understanding,
conversational recommendation, dialogue summarization, and knowledge-grounded
dialogues, making it an incredibly rich and diverse resource for dialogue
research and model training. To further enhance the utility of DialogStudio, we
identify the licenses for each dataset, design external knowledge and
domain-aware prompts for selected dialogues to facilitate instruction-aware
fine-tuning. Furthermore, we develop conversational AI models using the dataset
collection, and our experiments in both zero-shot and few-shot learning
scenarios demonstrate the superiority of DialogStudio. To improve transparency
and support dataset and task-based research, as well as language model
pre-training, all datasets, licenses, codes, and models associated with
DialogStudio are made publicly
accessible[].
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要