Accessible Russian Large Language Models: Open-Source Models and Instructive Datasets for Commercial Applications

D. P. Kosenko,Yu. M. Kuratov, D. R. Zharikova

Doklady Mathematics（2024）

引用 0|浏览0

暂无评分

摘要

This paper presents an approach to developing and fine-tuning large language models for Russian language that are capable of following instructions across domains. As base models, XGLM-4.5B, LLaMA-1 7B, LLaMA-1 13B, LLaMA-2 7B, LLaMA-2 13B, and ruGPT-3.5 13B are used. This work compares two main fine-tuning techniques: fine-tuning all model parameters and fine-tuning using LoRA layers. To create a fine-tuning dataset, several open English language data sources are used, including Databricks Dolly 15k, OpenAssistant Conversations Dataset (OASST1), chip2-instruct-alpha-v6a-1, which are then translated into Russian using the WMT21 En-X model. This work shows that the quality of the instructions provided for training significantly affects the ability to solve tasks on automatic quality metrics like MT-BENCH and MMLU. At the same time, the quality of models trained on the dataset collected as part of this work with a commercial license achieves comparable results to models fine-tuned on the Saiga dataset with a limited license. The fine-tuned language models and collected Russian language datasete are released open-source with licenses suitable for commercial use.

查看译文

关键词

large language models,language models,language models in Russian

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要