IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan,Sumanth Doddapaneni, Suriyaprasaad G, Varun Balan G, Sparsh Jain,Anoop Kunchukuttan,Pratyush Kumar,Raj Dabre,Mitesh M. Khapra

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1 Long Papers)(2024)

Cited 0|Views30
No score
Abstract
Despite the considerable advancements in English LLMs, the progress inbuilding comparable models for other languages has been hindered due to thescarcity of tailored resources. Our work aims to bridge this divide byintroducing an expansive suite of resources specifically designed for thedevelopment of Indic LLMs, covering 22 languages, containing a total of 251Btokens and 74.8M instruction-response pairs. Recognizing the importance of bothdata quality and quantity, our approach combines highly curated manuallyverified data, unverified yet valuable data, and synthetic data. We build aclean, open-source pipeline for curating pre-training data from diversesources, including websites, PDFs, and videos, incorporating best practices forcrawling, cleaning, flagging, and deduplication. For instruction-fine tuning,we amalgamate existing Indic datasets, translate/transliterate English datasetsinto Indian languages, and utilize LLaMa2 and Mixtral models to createconversations grounded in articles from Indian Wikipedia and Wikihow.Additionally, we address toxicity alignment by generating toxic prompts formultiple scenarios and then generate non-toxic responses by feeding these toxicprompts to an aligned LLaMa2 model. We hope that the datasets, tools, andresources released as a part of this work will not only propel the research anddevelopment of Indic LLMs but also establish an open-source blueprint forextending such efforts to other languages. The data and other artifacts createdas part of this work are released with permissive licenses.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined