Diversity Measurement and Subset Selection for Instruction Tuning Datasets
CoRR(2024)
摘要
We aim to select data subsets for the fine-tuning of large language models to
more effectively follow instructions. Prior work has emphasized the importance
of diversity in dataset curation but relied on heuristics such as the number of
tasks. In this paper, we use determinantal point processes to capture the
diversity and quality of instruction tuning datasets for subset selection. We
propose to measure dataset diversity with log determinant distance that is the
distance between the dataset of interest and a maximally diverse reference
dataset. Our experiments demonstrate that the proposed diversity measure in the
normalized weight gradient space is correlated with downstream
instruction-following performance. Consequently, it can be used to inform when
data selection is the most helpful and to analyze dataset curation strategies.
We demonstrate the utility of our approach on various instruction tuning
datasets.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要