Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications

D. P. Kosenko,Yu. M. Kuratov, D. R. Zharikova

Doklady Mathematics(2024)

引用 0|浏览0
暂无评分
摘要
This paper presents an approach to developing and fine-tuning large language models for Russian language that are capable of following instructions across domains. As base models, XGLM-4.5B, LLaMA-1 7B, LLaMA-1 13B, LLaMA-2 7B, LLaMA-2 13B, and ruGPT-3.5 13B are used. This work compares two main fine-tuning techniques: fine-tuning all model parameters and fine-tuning using LoRA layers. To create a fine-tuning dataset, several open English language data sources are used, including Databricks Dolly 15k, OpenAssistant Conversations Dataset (OASST1), chip2-instruct-alpha-v6a-1, which are then translated into Russian using the WMT21 En-X model. This work shows that the quality of the instructions provided for training significantly affects the ability to solve tasks on automatic quality metrics like MT-BENCH and MMLU. At the same time, the quality of models trained on the dataset collected as part of this work with a commercial license achieves comparable results to models fine-tuned on the Saiga dataset with a limited license. The fine-tuned language models and collected Russian language datasete are released open-source with licenses suitable for commercial use.
更多
查看译文
关键词
large language models,language models,language models in Russian
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要