In recent years, the emergence of Large Language Models (LLMs) has brought about significant shifts in the daily routines of consumers. Individuals can now undertake a diverse range of tasks, such as retrieving information, composing text, and refining documents through these powerful language tools. This integration of LLMs into daily life has resulted in notable boosts in productivity, both at work and in personal endeavors.
However, it’s important to recognize that not all consumers have experienced these benefits equally. Indeed, a considerable number of people around the world who speak less common languages aren’t able to interact with LLMs, primarily due to the inadequacy of language models designed for these specific languages. With 7,000 languages currently spoken in the world, the largest multilingual LLMs have been trained using only fewer than a hundred languages, thus leaving many languages and people completely behind.
Supporting non-English languages requires high-quality, abundant data sources, which can be difficult to find and access. And not only do those models perform worse but it also has been reported by
The performance of LLMs tailored for Low Resource Languages (LRL) is hindered by several key challenges.
Firstly, the foundation models for many LLMs rely on data scraped from the internet, which often lacks comprehensive coverage of LRLs. The below graph shows a distribution of data across the internet divided into language groups. While more common languages have hundreds of GB of data potentially available for training models, the languages in the tail of the graph only have data available in the range of hundreds of megabytes.
This limitation is further magnified by the absence of fine-tuned instruction datasets for many LRLs. An instruction dataset consists of a question set paired with ideal answers and is a crucial part of LLM training - in this case, in specific languages. This is how the model learns to follow instructions, and without this asset, models are only capable of predicting the next word in the sequence instead of assisting humans with complex questions and problem-solving tasks.
The above is caused by the fact that LLMs are trained in sequential steps. The first step is to learn the language by reading a large amount of unannotated text which gives the model the ability to predict the next world in the sequence. The second step is tailoring this predictive behavior to follow specific instructions, such as answering questions, writing summaries, or extracting data. This is why fine-tuning datasets is of such importance, as their quality will further determine the ability of LLM to assist users with required tasks.
In the following section, we will present a method to create a high-quality dataset for Swahili that can be used to fine-tune the LLM for this language. The method can be applied to any low-resource language.
Swahili is a language spoken by over 200 million people across 14 different African countries and is the official national language in Tanzania, Kenya, Uganda, and the Democratic Republic of the Congo. It belongs to the group of low-resource languages and is an example of a language that does not have an out-of-the-box instruction dataset for LLM fine-tuning.
In general, three approaches exist to create a fine-tuning dataset for a language. The first one is the direct generation of a dataset by assessors, in this case, language experts, which requires developing both questions and ideal answers in the desired language. This can be challenging for the Swahili language because assessors need to be high-level experts and the process is generally expensive.
Another potential solution is taking an existing instruction dataset in English and translating it into Swahili. This could be done by translators who speak both Swahili and English but this can also be time and resource intensive. An automatic translator could be used, however, this typically results in insufficient or poor-quality results.
Another solution combines automated translation with human validation, offering a cost-efficient and scalable approach, which is critical to ensure LRL models are accurate, reflect local customs and norms, and are useful to the communities that will be using them. This method utilizes the best available automatic translator from Swahili to English and then asks native Swahili speakers to filter out examples that do not meet quality standards.
Toloka recently undertook a development project, where they created an 11,000 fine-tuning dataset for Swahili from the 15,000 original
The dataset was then used to improve
As developers and organizations strive to create a more inclusive AI ecosystem, evaluation becomes even more critical, as does human involvement in training LLMs. Cohere’s recent launch of